# Clock Distribution Networks for 3-D Integrated Circuits

Vasilis F. Pavlidis, Ioannis Savidis, and Eby G. Friedman Department of Electrical and Computer Engineering University of Rochester, Rochester, New York 14627 {pavlidis, iosavid, friedman}@ece.rochester.edu

*Abstract* - Three-dimensional (3-D) integration is an important technology that addresses fundamental limitations of on-chip interconnects. Several design issues related to 3-D circuits, such as multi-plane synchronization, however, need to be addressed. A comparison of three 3-D clock distribution network topologies is presented in this paper. Experimental results of a 3-D test circuit manufactured by the MIT Lincoln Laboratories are also described. Successful operation of the 3-D test circuit at 1.4 GHz is demonstrated. Clock skew and power dissipation measurements for the different clock topologies are also provided.

# I. INTRODUCTION

An omnipresent and challenging issue for synchronous digital circuits is the reliable distribution of the clock signal to the many thousands of sequential elements distributed throughout a synchronous circuit [1], [2]. The complexity of this task is further increased in 3-D ICs as sequential elements belonging to the same clock domain (*i.e.*, synchronized by the same clock signal) can be located on multiple planes. Another important issue in the design of clock distribution networks is low power consumption, since the clock network dissipates a significant portion of the total power consumed by a synchronous circuit [3], [4]. This demand is stricter for 3-D ICs due to the increased power density and related thermal limitations.

In 2-D circuits, symmetric interconnect structures, such as H- and X-trees, are widely utilized to distribute the clock signal across a circuit [2]. The symmetry of these structures permits the clock signal to arrive at the leaves of the tree at the same time, resulting in synchronous data processing. Maintaining this symmetry within a 3-D circuit, however, is a difficult task.

An extension of an H-tree to three dimensions does not guarantee equidistant interconnect paths from the root to the leaves of the tree. The clock signal propagates through vertical interconnects, typically implemented by through silicon vias (TSVs) from the output of the clock driver to the center of the H-tree on the other planes. The impedance of the TSVs can increase the time for the clock signal to arrive at the leaves of the tree on these planes as compared to the time for the clock signal to arrive at the leaves of the tree located on the same plane as the clock driver. Furthermore, in a multi-plane 3-D circuit, three or four branches can emanate at each branch point. The third and fourth branches propagate the clock signal to the other planes of the 3-D circuit. Similar to a design methodology for a 2-D H-tree topology, the width of each branch is reduced by a third (or more) of the segment width preceding the branch point in order to match the impedance at that branch point. This requirement, however, is difficult to achieve as the third and fourth branches are connected by a TSV.

Global signaling issues in 3-D circuits, such as clock signal distribution, are essentially unexplored. Recent papers consider thermal effects on buffered 3-D clock trees [5] and H-tree topologies [6], [7]. No experimental characterization of 3-D clock distribution networks, however, has been presented. Measurements from a 3-D test circuit fabricated by the MIT Lincoln Laboratories (MITLL) [8] employing several clock distribution architectures are presented for the first time in this paper.

In the following section, the design of the 3-D test circuit is described. A brief discussion of the MITLL process is provided in Section III. Experimental results and a discussion of the characteristics of the three clock distribution networks are presented in Section IV. Some conclusions are offered in Section V.

# II. DESIGN OF THE 3-D TEST CIRCUIT

The test circuit consists of three blocks. Each block includes the same logic circuit but implements a different clock distribution architecture. The total area of the test circuit is 3 mm  $\times$  3 mm, and each block occupies an approximate area of 1 mm<sup>2</sup>. Each block contains about 30,000 transistors with a power supply voltage of 1.5 volts. The design kit used for the implementation has been provided by North Carolina State University [10]. The common logic circuitry used in each clock module is described in Section II-A, and the different clock distribution architectures are reviewed in Section II-B.

## A. 3-D Circuit Architecture

The logic circuit common to the three blocks is described in this section. An overview of the logic circuitry is depicted in Fig. 1. The function of the logic is to emulate different switching patterns of the circuit and load condi-

This research is supported in part by the National Science Foundation under Contract No. CCF-0541206, grants from the New York State Office of Science, Technology & Academic Research to the Center for Advanced Technology in Electronic Imaging Systems, by grants from Intel Corporation, Eastman Kodak Company, and Freescale Semiconductor Corporation, and foundry support from MIT Lincoln Laboratories.

tions for the clock distribution networks under investigation. The logic is repeated in each plane and includes pseudorandom number generators (PNG), a six by six crossbar switch, control logic for the crossbar switch, several groups of fourbit counters and current loads, and RF output pads for probing.

The pseudorandom number generators use linear feedback shift registers and XOR operations to generate a random 16-bit word every clock cycle once the generators are initialized [9]. The data flow in this circuit can be described as follows. After resetting the circuit, the PNGs are initialized and the control logic connects each input port to the appropriate output port. Since the control logic includes an eight-bit counter, each input port of the crossbar switch is successively connected every 256 clock cycles to each output port.



Fig. 1 Block diagram of the logic circuit included on each plane for each clock topology.

The current loads are implemented with cascode current mirrors, as shown in Fig. 2. In these cascode current mirrors, the output current  $I_{out}$  closely follows  $I_{ref}$  as compared to a simple current mirror. The reference current  $I_{ref}$  is externally provided to control the amount of current drawn from the circuit. The gate of transistor M5 is connected to the MSB of a four-bit counter, shown in Fig. 2 as the *sel* signal. This additional device is used to switch the current sinks. The width of the devices shown in Fig. 2 is  $W_1 = W_2 = W_3 = W_4 = 600$  nm, and  $W_5 = 2000$  nm.



Fig. 2 Cascoded current mirror with an additional control transistor.

Several capacitors are included in each circuit block and serve as an extrinsic decoupling capacitance and are implemented by MIM capacitors. Additionally, each of the circuit blocks is supplied by separate power and ground pads (three pairs of power and ground pads per block) to ensure that each block can be individually tested. Furthermore, one pair of power and ground pads is connected to the pad ring in order to provide protection from electrostatic discharge.

#### B. 3-D Clock Topologies

Several clock network topologies for 3-D ICs are described in this section. These architectures combine different topologies which are commonplace in 2-D circuits, such as H-trees, rings, and meshes [2]. Each of the three blocks includes a different clock distribution structure, which are schematically illustrated in Fig. 3. The dashed lines depict vertical interconnects implemented by through silicon vias.



Fig. 3 Various 3-D clock distribution networks within the test circuit, (a) H-trees, (b) H-tree and local rings/meshes, (c) H-tree and global rings.

In each of the circuit blocks, the clock driver for the entire clock network is located on the second plane. The location of the clock driver is chosen to ensure that the clock signal propagates through identical vertical interconnect paths to the first and third planes, ideally resulting in the same delay when the clock signal reaches the first and third planes. The clock driver is implemented with a traditional chain of tapered buffers [11], [12]. Additionally, buffers are inserted at the leaves of each H-tree in all three topologies.

The architectures employed in the blocks are: Block A: All of the planes contain a four level H-tree (*i.e.*, equivalent to 16 leaves) with identical interconnect characteristics. All of the H-trees are connected through a group of TSVs at the output of the clock driver. Note that the H-tree on the second plane is rotated by  $90^{\circ}$  with respect to the H-trees on the other two planes. This rotation eliminates inductive coupling between the H-trees. All of the H-trees are shielded with two parallel lines connected to ground.

*Block B*: A four level H-tree is included in the second plane. All of the leaves of this H-tree are connected by TSVs to small local rings on the first and third planes. As in Block A, the H-tree is shielded with two parallel lines connected to ground. Additional interconnect resources are used to form local meshes. Due to the limited interconnect resources, however, achieving a uniform mesh in each ring is difficult. Clock routing is constrained by the power and ground lines as only three metal layers are available on each plane [8]. *Block C*: The clock distribution network for the second plane is a shielded four level H-tree. Two global rings are utilized for the other two planes. Buffers are inserted to drive each ring, which are connected by TSVs to the four branch points on the second level of the H-tree. The registers in each plane are connected either directly to the ring or to buffers at the leaves of the tree on the second plane.

# III. FABRICATION OF THE 3-D TEST CIRCUIT

The manufacturing process developed by MITLL for fully depleted silicon-on-insulator (FDSOI) 3-D circuits is summarized here [8]. The MITLL process is a wafer level 3-D integration technology with up to three FDSOI wafers bonded to form a 3-D circuit. The diameter of the wafers is 150 mm. The minimum feature size of the devices is 180 nm, with one polysilicon layer and three metal layers interconnecting the devices on each wafer. A backside metal layer also exists on the upper two planes, providing the pads for the TSVs and the I/O, power supply, and ground pads for the entire 3-D circuit. An attractive feature of this process is the high density TSVs. The dimensions of these vias are 1.75  $\mu$ m × 1.75  $\mu$ m, much smaller than the size of the through silicon via in many existing 3-D technologies [13], [14]. An intermediate step of the fabrication process is illustrated in Fig. 4, where some salient features of this technology are also depicted.

## IV. EXPERIMENTAL RESULTS

The clock distribution network topologies of the 3-D test circuit are evaluated in this section. The fabricated circuit is depicted in Fig. 5, where the different blocks can be distinguished. Each block includes four RF pads for measuring the delay of the clock signal. The pad located at the center of each block provides the input clock signal. The clock input is a sinusoidal signal with a DC offset, which is converted to a square waveform at the output of the clock driver. The remaining three RF pads are used to measure the delay of the clock signal at specific points on the clock distribution network within each plane. A buffer is connected at each of these measurement points, and the output of this buffer drives the gate of an open drain transistor connected to the RF pad.

A clock waveform acquired from the topology combining an H-tree and global rings, shown in Fig. 3a, is illustrated in Fig. 6, demonstrating operation of the circuit at 1.4 GHz. The clock skew between the planes of each block is listed in Table I. For the H-tree topology, the clock signal delay is measured from the root to a leaf of the tree on each plane, with no other load connected to these leaves. The skew between the leaves of the H-trees on planes A and C (*i.e.*  $t_{AC}$ ) is essentially the delay of a stacked TSV traversing between the three planes to transfer the clock signal from the target leaf to the RF pad on the third plane. The delay  $t_B$  is larger due to the additional capacitance coupled into that quadrant of the H-tree on the second plane. This capacitance is intentional on-chip decoupling capacitance placed under this quadrant, increasing the measured skew,  $t_{BC}$  and  $t_{BA}$ . This topology produces, on average, the lowest skew as compared to the other two topologies.



Fig. 4 Intermediate step of the MITLL process [8]. The second plane is flipped and bonded with the first plane. The backside metal layer and vias and the through silicon vias are also shown.



Fig. 5 Fabricated 3-D circuit. Some of the RF pads are also depicted.

Table I Measured clock skew among the planes of each block.

| Clock distribution network | Clock skew [ps]      |                      |                      |
|----------------------------|----------------------|----------------------|----------------------|
|                            | $t_{BA} = t_B - t_A$ | $t_{BC} = t_B - t_C$ | $t_{AC} = t_A - t_C$ |
| H-trees (Fig. 3a)          | 32.5                 | 28.3                 | -4.2                 |
| Local meshes (Fig. 3b)     | -68.4                | -18.5                | 49.8                 |
| Global rings (Fig. 3c)     | -112.0               | -130.6               | -18.6                |

The clock skew among the planes is greater for the local mesh topology as compared to the H-tree topology, primarily due to the unbalanced clock load for certain local meshes. The greatest difference in the load is between the measurement points on planes A and B, which also produces the largest skew for this topology. The increase in skew, however, is moderate. Additionally, the local meshes ever, is moderate. Additionally, the local meshes reduce the local skew for the load connected to each sink of the H-tree.

Alternatively, the clock distribution network that includes the global rings exhibits very low skew for planes A and C, those planes that include the global rings. Although the clock load on each ring is non-uniformly distributed, the load balancing characteristic of the ring yields a relatively low skew between these planes. Since the clock distribution network on the second plane is implemented with an H-tree, skew  $t_{BC}$  and  $t_{BA}$  is significantly larger than  $t_{AC}$ . Note that the leaf of the H-tree, where the clock signal delay for the second plane is measured, is located at a great distance from the rings on planes A and C (see Fig. 3c). A combination of the H-tree and global rings, consequently, is not a suitable approach for 3-D circuits due to the large difference in distance that the clock signal traverses on each plane.



Fig. 6 Clock signal input and output waveform from the topology illustrated in Fig. 3c.

In Table II, the measured power consumption of the blocks operating at 1 GHz is reported. The local mesh topology demonstrates the lowest power consumption. This topology requires the least interconnect resources for the global clock network, since the local meshes are connected at the output of the buffers located on the last level of the H-tree on the second plane. In addition, this topology requires a moderate amount of local interconnect resources as compared to the H-tree and local mesh topologies. Alternatively, the power consumed by the H-tree topology is the highest as this topology requires three H-trees and additional wiring for local connections to the leaves of each tree. Finally, the global rings block consumes slightly less power than the H-tree topology due to the reduced amount of wiring resources used by the global clock network.

# V. CONCLUSIONS

Three topologies to globally distribute a clock signal in 3-D circuits have been evaluated. A 3-D test circuit, based on the MITLL 3-D IC manufacturing process, has been designed, fabricated, and measured and is shown to operate at 1.4 GHz. Clock skew measurements indicate that a topology

that combines the symmetry of an H-tree on the second plane and local meshes on the other two planes will result in low clock skew for 3-D circuits while consuming the lowest power as compared to the other investigated topologies.

Table II Measured power consumption of each block at 1 GHz.

| Clock distribution network | Power consumption [mW] |
|----------------------------|------------------------|
| H-trees (Fig. 3a)          | 260.3                  |
| Local meshes (Fig. 3b)     | 168.3                  |
| Global rings (Fig. 3c)     | 228.5                  |

#### ACKNOWLEDGMENTS

The authors would like to thank Yunliang Zhu, Lin Zhang, and Professor Hui Wu of the University of Rochester for their help during testing and the MIT Lincoln Laboratories for fabricating the 3-D circuit.

#### REFERENCES

- [1] E. G. Friedman (Ed.), *Clock Distribution Networks in VLSI Circuits and Systems*, IEEE Press, New Jersey, 1995.
- [2] E. G. Friedman, "Clock Distribution Networks in Synchronous Digital Integrated Circuits," *Proceedings of the IEEE*, Vol. 89, No. 5, pp. 665-692, May 2001.
- [3] D. W. Bailey and B. J. Benschneider, "Clocking Design and Analysis for a 600-MHz Alpha Microprocessor," *IEEE Journal of Solid-State Circuits*, Vol. 22, No. 11, pp. 1627-1633, November 1998.
- [4] T. Xanthopoulos et al., "The Design and Analysis of the Clock Distribution Network for a 1.2 GHz Alpha Microprocessor," Proceedings of the IEEE International Solid-State Circuits Conference, pp. 402-402, February 2001.
- [5] J. Minz, X. Zhao, and S. K. Lim, "Buffered Clock Tree Synthesis for 3D ICs Under Thermal Variations," *Proceedings of the IEEE International Asia and South Pacific Design Automation Conference*, pp. 504-509, January 2008.
- [6] M. Mondal et al., "Thermally Robust Clocking Schemes for 3D Integrated Circuits," Proceedings of the IEEE International Conference on Design, Automation and Test in Europe, pp. 1206-1211, April 2007.
- [7] V. Arunachalam and W. Burleson, "Low-Power Clock Distribution in a Multilayer Core 3D Microprocessor," *Proceedings of the* ACM/IEEE International Great Lakes Symposium on VLSI, May 2008.
- [8] "FDSOI Design Guide," MIT Lincoln Laboratories, Cambridge, 2006.
- [9] W. Cui, H. Chen, and Y. Han, "VLSI Implementation of Universal Random Number Generator," *Proceedings of the IEEE Asia-Pacific Conference on Circuits and Systems*, Vol. 1, pp. 465-470, October 2002.
- [10] Available on-line: http://www.ece.ncsu.edu/erl/3DIC/pub
- [11] N. Hedenstierna and K. O. Jeppson, "CMOS Circuit Speed and Buffer Optimization," *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems*, Vol. CAD-6, No. 2, pp. 270-281, March 1987.
- [12] B. S. Cherkauer and E. G. Friedman, "A Unified Design Methodology for CMOS Tapered Buffers," *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, Vol. VLSI-3, No. 1, pp. 99-111, March 1995.
- [13] M. W. Newman et al., "Fabrication and Electrical Characterization of 3D Vertical Interconnects," Proceedings of the IEEE International Electronic Components and Technology Conference, pp. 394-398, June 2006.
- [14] P. Dixit and J. Miao, "Fabrication of High Aspect Ratio 35 μm Pitch Interconnects for Next Generation 3-D Wafer Level Packaging by Through-Wafer Copper Electroplating," *Proceedings of the IEEE International Electronic Components and Technology Conference*, pp. 388-393, June 2006.