ResearchHub | Open Science Community

EIE

Song Han et al.Jun 18, 2016

State-of-the-art deep neural networks (DNNs) have hundreds of millions of connections and are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources and power budgets. While custom hardware helps the computation, fetching weights from DRAM is two orders of magnitude more expensive than ALU operations, and dominates the required power. Previously proposed 'Deep Compression' makes it possible to fit large DNNs (AlexNet and VGGNet) fully in on-chip SRAM. This compression is achieved by pruning the redundant connections and having multiple connections share the same weight. We propose an energy efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing. Going from DRAM to SRAM gives EIE 120× energy saving; Exploiting sparsity saves 10×; Weight sharing gives 8×; Skipping zero activations from ReLU saves another 3×. Evaluated on nine DNN benchmarks, EIE is 189× and 13× faster when compared to CPU and GPU implementations of the same DNN without compression. EIE has a processing power of 102 GOPS working directly on a compressed network, corresponding to 3 TOPS on an uncompressed network, and processes FC layers of AlexNet at 1.88×10 4 frames/sec with a power dissipation of only 600mW. It is 24,000× and 3,400× more energy efficient than a CPU and GPU respectively. Compared with DaDianNao, EIE has 2.9×, 19× and 3× better throughput, energy efficiency and area efficiency.

Artificial Intelligence

Electrical And Electronic Engineering

0

Paper

Artificial Intelligence

1,955

0

Save

0

The future of wires

Richard Lee et al.Apr 1, 2001

Concern about the performance of wires wires in scaled technologies has led to research exploring other communication methods. This paper examines wire and gate delays as technologies migrate from 0.18-/spl mu/m to 0.035-/spl mu/m feature sizes to better understand the magnitude of the the wiring problem. Wires that shorten in length as technologies scale have delays that either track gate delays or grow slowly relative to gate delays. This result is good news since these "local" wires dominate chip wiring. Despite this scaling of local wire performance, computer-aided design (CAD) tools must still become move sophisticated in dealing with these wires. Under scaling, the total number of wires grows exponentially, so CAD tools will need to handle an ever-growing percentage of all the wires in order to keep designer workloads constant. Global wires present a more serious problem to designers. These are wires that do not scale in length since they communicate signals across the chip. The delay of these wives will remain constant if repeaters are used meaning that relative to gate delays, their delays scale upwards. These increased delays for global communication will drive architectures toward modular designs with explicit global latency mechanisms.

Electrical And Electronic Engineering

Computer Science

0

Paper

Electrical And Electronic Engineering

1,438

0

Save

0

1.1 Computing's energy problem (and what we can do about it)

Mark HorowitzFeb 1, 2014

Our challenge is clear: The drive for performance and the end of voltage scaling have made power, and not the number of transistors, the principal factor limiting further improvements in computing performance. Continuing to scale compute performance will require the creation and effective use of new specialized compute engines, and will require the participation of application experts to be successful. If we play our cards right, and develop the tools that allow our customers to become part of the design process, we will create a new wave of innovative and efficient computing devices.

Mechanical Engineering

Electrical And Electronic Engineering

0

Paper

Mechanical Engineering

1,378

0

Save

0

The Stanford Dash multiprocessor

Daniel Lenoski et al.Mar 1, 1992

The overall goals and major features of the directory architecture for shared memory (Dash) are presented. The fundamental premise behind the architecture is that it is possible to build a scalable high-performance machine with a single address space and coherent caches. The Dash architecture is scalable in that it achieves linear or near-linear performance growth as the number of processors increases from a few to a few thousand. This performance results from distributing the memory among processing nodes and using a network with scalable bandwidth to connect the nodes. The architecture allows shared data to be cached, significantly reducing the latency of memory accesses and yielding higher processor utilization and higher overall performance. A distributed directory-based protocol that provides cache coherence without compromising scalability is discussed in detail. The Dash prototype machine and the corresponding software support are described.< >

Architecture

Computer Networks And Communications

0

Paper

Save

High performance imaging using large camera arrays

Bennett Wilburn et al.Jul 1, 2005

The advent of inexpensive digital image sensors and the ability to create photographs that combine information from a number of sensed images are changing the way we think about photography. In this paper, we describe a unique array of 100 custom video cameras that we have built, and we summarize our experiences using this array in a range of imaging applications. Our goal was to explore the capabilities of a system that would be inexpensive to produce in the future. With this in mind, we used simple cameras, lenses, and mountings, and we assumed that processing large numbers of images would eventually be easy and cheap. The applications we have explored include approximating a conventional single center of projection video camera with high performance along one or more axes, such as resolution, dynamic range, frame rate, and/or large aperture, and using multiple cameras to approximate a video camera with a large synthetic aperture. This permits us to capture a video light field, to which we can apply spatiotemporal view interpolation algorithms in order to digitally simulate time dilation and camera motion. It also permits us to create video sequences using custom non-uniform synthetic apertures.

Artificial Intelligence

Media Technology

0

Paper

Artificial Intelligence

975

0

Save

0

Signal Delay in RC Tree Networks

J. Rubinstein et al.Jul 1, 1983

In MOS integrated circuits, signals may propagate between stages with fanout. The exact calculation of signal delay through such networks is difficult. However, upper and lower bounds for delay that are computationally simple are presented in this paper. The results can be used 1) to bound the delay, given the signal threshold, or 2) to bound the signal voltage, given a delay time, or 3) certify that a circuit is "fast enough," given both the maximum delay and the voltage threshold.

Philosophy

Artificial Intelligence

0

Paper

Save

Forwarding metamorphosis

Pat Bosshart et al.Aug 13, 2013

In Software Defined Networking (SDN) the control plane is physically separate from the forwarding plane. Control software programs the forwarding plane (e.g., switches and routers) using an open interface, such as OpenFlow. This paper aims to overcomes two limitations in current switching chips and the OpenFlow protocol: i) current hardware switches are quite rigid, allowing ``Match-Action'' processing on only a fixed set of fields, and ii) the OpenFlow specification only defines a limited repertoire of packet processing actions. We propose the RMT (reconfigurable match tables) model, a new RISC-inspired pipelined architecture for switching chips, and we identify the essential minimal set of action primitives to specify how headers are processed in hardware. RMT allows the forwarding plane to be changed in the field without modifying hardware. As in OpenFlow, the programmer can specify multiple match tables of arbitrary width and depth, subject only to an overall resource limit, with each table configurable for matching on arbitrary fields. However, RMT allows the programmer to modify all header fields much more comprehensively than in OpenFlow. Our paper describes the design of a 64 port by 10 Gb/s switch chip implementing the RMT model. Our concrete design demonstrates, contrary to concerns within the community, that flexible OpenFlow hardware switch implementations are feasible at almost no additional cost or power.

Computer Networks And Communications

Hardware And Architecture

0

Paper

Computer Networks And Communications

749

0

Save

0

The Stanford FLASH multiprocessor

Jeffrey Kuskin et al.Apr 1, 1994

The FLASH multiprocessor efficiently integrates support for cache-coherent shared memory and high-performance message passing, while minimizing both hardware and software overhead. Each node in FLASH contains a microprocessor, a portion of the machine's global memory, a port to the interconnection network, an I/O interface, and a custom node controller called MAGIC. The MAGIC chip handles all communication both within the node and among nodes, using hardwired data paths for efficient data movement and a programmable processor optimized for executing protocol operations. The use of the protocol processor makes FLASH very flexible --- it can support a variety of different communication mechanisms --- and simplifies the design and implementation.This paper presents the architecture of FLASH and MAGIC, and discusses the base cache-coherence and message-passing protocols. Latency and occupancy numbers, which are derived from our system-level simulator and our Verilog code, are given for several common protocol operations. The paper also describes our software strategy and FLASH's current status.

Computer Networks And Communications

Hardware And Architecture

0

Paper

Computer Networks And Communications

680

0

Save

0

Light field microscopy

Marc Levoy et al.Jul 1, 2006

By inserting a microlens array into the optical train of a conventional microscope, one can capture light fields of biological specimens in a single photograph. Although diffraction places a limit on the product of spatial and angular resolution in these light fields, we can nevertheless produce useful perspective views and focal stacks from them. Since microscopes are inherently orthographic devices, perspective views represent a new way to look at microscopic specimens. The ability to create focal stacks from a single photograph allows moving or light-sensitive specimens to be recorded. Applying 3D deconvolution to these focal stacks, we can produce a set of cross sections, which can be visualized using volume rendering. In this paper, we demonstrate a prototype light field microscope (LFM), analyze its optical performance, and show perspective views, focal stacks, and reconstructed volumes for a variety of biological specimens. We also show that synthetic focusing followed by 3D deconvolution is equivalent to applying limited-angle tomography directly to the 4D light field.

Biophysics

Atomic And Molecular Physics, And Optics

0

Paper

Save

Energy dissipation in general purpose microprocessors

R. Gonzalez et al.Jan 1, 1996

In this paper we investigate possible ways to improve the energy efficiency of a general purpose microprocessor. We show that the energy of a processor depends on its performance, so we chose the energy-delay product to compare different processors. To improve the energy-delay product we explore methods of reducing energy consumption that do not lead to performance loss (i.e. wasted energy), and explore methods to reduce delay by exploiting instruction level parallelism. We found that careful design reduced the energy dissipation by almost 25%. Pipelining can give approximately a 2/spl times/ improvement in energy-delay product. Superscalar issue, however, does not improve the energy-delay product any further since the overhead required offsets the gains in performance. Further improvements will be hard to come by since a large fraction of the energy (50-80%) is dissipated in the clock network and the on-chip memories. Thus, the efficiency of processors will depend more on the technology being used and the algorithm chosen by the programmer than the micro-architecture.

Electrical And Electronic Engineering

Hardware And Architecture

0

Paper

Electrical And Electronic Engineering

635

0

Save