# Minimizing Power Dissipation in Non-Zero Skew-based Clock Distribution Networks

José Luis Neves and Eby G. Friedman

Department of Electrical Engineering University of Rochester Rochester, New York 14627

#### ABSTRACT

A methodology is presented in this paper for synthesizing low power clock distribution networks. The clock distribution networks are designed with localized nonzero clock skew so as to improve circuit performance and reliability. Each branch of the clock tree is assigned a delay value that is emulated by one or more CMOS inverters, each designed such that the output load appears as being predominantly capacitive. A design technique is presented for selecting the size and number of inverters within each branch such that the total power dissipated within the clock distribution network is minimized. The power dissipation model considers both dynamic and short circuit power components. Simulation results exhibit reductions of up to 25% in total power dissipation within the clock distribution network, while accurately implementing the desired clock skews.

# 1. Introduction

In a modern VLSI circuit, a clock distribution network may drive thousands of registers, creating a large capacitive load that must be efficiently sourced. Furthermore, each transition of the clock signal changes the state of each capacitive node within the clock distribution network, in contrast with the switching activity in combinational logic blocks, where the change of logic state is dependent on the logic function. The combination of large capacitive loads and a continuous demand for higher clock frequencies has led to an increasingly larger proportion of the total power dissipated within the clock distribution network, in some applications much greater than 25% of the total power dissipated within a VLSI circuit [1].

The primary components of power dissipation in most digital circuits are dynamic and short-circuit power. It is possible to reduce  $CV^2f$  dynamic power by lowering the clock frequency, the power supply, and/or the capacitive load of the clock distribution network. Lowering the clock frequency, however, conflicts with the primary goal of developing high speed VLSI circuits. Therefore, for a given circuit implementation, low dynamic power dissipation is best achieved by employing certain techniques for either minimizing the power supply and/or the capacitive load. Recently, De Man [2,3] introduced a technique for designing clock buffers and pipeline registers such that the

This research is based upon work supported by Grant 200484/89.3 from CNPq (Conselho Nacional de Desenvolvimento Científico e Tecnológico - Brasil) and the National Science Foundation under Grant No. MIP-9208165.

clock distribution network operates at half the power supply swing, reducing the power consumption by 60% without compromising the clock frequency of the circuit.

Short-circuit current occurs when the NMOS and PMOS transistors are being driven by an input signal with a finite slope, turning both transistors on at the same time. A technique for designing clock buffers is described in [4] that substantially reduces short-circuit current. This technique models the clock distribution network as a single lumped capacitor, and a reduction of more than 10% is reported for both power dissipation and transistor area.

In this paper a design methodology is presented for minimizing the total power dissipated in a clock tree by reducing the effective capacitance required to implement the clock tree, and by sharpening the input signals driving each inverter stage, thereby reducing the time when both transistors are on. In Section 2, a methodology for designing the circuits within the clock distribution network is reviewed, and the process for implementing the branch delays within a clock path with CMOS inverters is described. The design techniques developed for minimizing the power dissipated within each branch are presented in Section 3. Reductions in power dissipation while preserving the accuracy of the circuit implementation of the clock distribution network are demonstrated in Section 4. Finally, the primary results of this paper are summarized in Section 5.

## 2. CIRCUIT DESIGN OF THE CLOCK TREE

A four phase methodology for synthesizing clock distribution networks which exploits non-zero clock skew for improving performance and reliability has been developed [5,6]. In the first phase of this methodology, a set of non-zero clock skew values is determined for each local data path that maximizes the clock frequency while ensuring that no race conditions exist. In the second phase, the topology of the clock tree is derived from the hierarchical structure of the functional circuit, permitting delay values to be assigned to each branch of the clock tree such that the set of non-zero clock skews is satisfied. In the third phase, CMOS circuit structures are designed to emulate the delay of each branch of the clock tree. Finally, in the fourth phase, the physical layout of the clock distribution network is implemented based on the hierarchical and circuit information. A description of the first three phases of this design methodology can be found in [5,6].

In the circuit implementation of a clock distribution network, circuit structures are designed to emulate the delay values associated with each branch of the clock tree. Special attention is placed on guaranteeing that the clock skew between any two clock paths is satisfied rather than solely satisfying each individual clock path delay. Therefore, each clock path must be designed such that the delay of each branch is emulated as accurately as possible. This design process is described below.

### 2.1 Implementation of Clock Path Delay

A clock tree is composed of many clock signal paths (branches) each driving a storage element (leaf). Clock signal paths may have branches in common, depending on the topology of the clock tree. The total delay of a clock path  $t_{cpd}$  is the summation of the delays of each individual branch along the clock path  $\tau_{bi}$ ,

$$t_{cpd} = \sum_{i=1}^{n} \tau_{bi} \qquad . \tag{1}$$

In order to accurately sum the individual delay components along a clock signal path, three issues are of importance: 1) isolation of each branch delay; 2) integration of the inverter and interconnect delay models used to calculate the delay of each clock path, and 3) integration of the waveform shape into the inverter delay model.

Due to the high input impedance of a CMOS inverter. the inverter effectively isolates each clock branch from the following branch. Furthermore, each branch can be implemented with one or more repeaters, such that the output impedance of each repeater is much greater than the resistance of the driven interconnect section [7]. Under this constraint, each inverter drives a load that can be modeled as a lumped capacitor composed of the capacitance of an interconnect line plus the input capacitance of each inverter connected to that interconnect line. As a consequence of the output impedance of the repeater buffer being much greater than the resistance of the interconnect section, the slope of the input signal of an inverter connected to a branching point can be approximated as the same as the slope of the output signal of the inverter connected to that branching point [5]. Since the capacitive load includes the interconnect line capacitance and the input gate capacitance of each driven inverter, the transistor size and output load of each inverter are each determined such that the branch delay and the total delay of a clock path are both satisfied.

# 2.2 Branch Delay Modeling

In the existing design methodology, the delay of a branch is implemented with one or more CMOS inverters, as illustrated in Figure 1. The delay equations of each inverter are based on the MOSFET  $\alpha$ -power law I-V model developed by Sakurai and Newton [8].

Each inverter is assumed to be driven by a ramp signal with symmetric rising and falling slopes, selected such that during discharge (charge), the effects of the PMOS (NMOS) transistor can be neglected [9]. The capacitive load of an inverter is given by (2),

$$C_{Li} = \frac{2I_{DO}}{V_{DD}} \left[ t_{di} - \left( \frac{1}{2} - \frac{1 - v_T}{1 + \alpha} \right) t_{Ti-1} \right] , \qquad (2)$$

where  $I_{DO}$  is the drain current at  $V_{GS} = V_{DS} = V_{DD}$ ,  $V_{DO}$  is the drain saturation voltage at  $V_{GS} = V_{DD}$ ,  $V_{th}$  is the threshold voltage,  $\alpha$  is the velocity saturation index,  $V_{DD}$  is the power supply,  $t_{di}$  is the delay of an inverter defined as the 50%  $V_{DD}$  point of the input waveform to the 50%  $V_{DD}$  point of the output waveform,  $v_T = V_{th}/V_{DD}$ , and  $t_{Ti}$  is the transition time of the input signal. Note that  $C_{Li}$  is composed of the capacitance of the driven interconnect line and the total gate capacitance of all  $b_{i+1}$  inverters.



Figure 1: Design of a branch delay element

Since  $t_{di}$  is known, the only unknown in (2) is the transition time of the input signal  $t_{Ti}$  (3), which can be approximated by a ramp shaped waveform, or by linearly connecting the points 0.1  $V_{DD}$  and 0.9  $V_{DD}$  of the output waveform. This assumption is accurate as long as the interconnect resistance is negligible as compared with the inverter output impedance.

$$t_{Ti} = \frac{t_{0.9} - t_{0.8}}{0.8} = \frac{C_{Li} V_{DD}}{I_{DO}} \left( \frac{0.9}{0.8} + \frac{V_{DO}}{0.8 V_{DD}} ln \frac{10 V_{DO}}{e V_{DD}} \right)$$
(3)

For each clock path within the clock tree, the procedure to design the CMOS inverters is as follows: 1) the load of the initial trunk of the clock tree is determined from (2), assuming a step input clock signal; 2) the slope of the output signal is calculated from (3); 3) this value is applied in (2) to determine the load of the following branch, and (3) is used again to determine the slope of the output signal; and 4) step 3 is repeated for each subsequent branch of the clock path. Steps 1-4 are applied to the remaining clock paths within the clock tree. Observe that if the transition time of the output signal of branch  $b_i$  does not satisfy

$$t_{T_{i}} \le \frac{I}{\left(\frac{I}{2} - \frac{I - v_{T}}{I + \alpha}\right)} \left(t_{di+1} - \frac{V_{DD}C_{Li+1}}{2I_{DO}}\right) , \qquad (4)$$

(2) is no longer valid. The transition time  $t_{Ti}$  can be reduced in order to satisfy (4) by increasing the output current drive of the inverter in branch  $b_i$ . However, increasing  $I_{DOi}$  would increase the capacitive load  $C_{Li}$  in order to maintain the propagation delay  $t_{di}$  for branch  $b_i$ . Therefore, the transition time associated with branch  $b_i$  must be maintained constant

as long as the propagation delay  $t_{di}$  of the branch  $b_i$  remains the same. Furthermore, the number of inverters required to implement the propagation delay  $t_{di}$  is chosen such that (4) is satisfied and the polarity of the clock signal driving branch  $b_{i+1}$  does not change. For comparison purposes, this design approach is called method # 1.

#### 3. Low Power Clock Tree Design

The total power dissipation of a clock distribution network  $P_{Total}$  can be decomposed into two components,

$$P_{Total} = P_{dy} + P_{sc} (5)$$

where  $P_{dy}$  is the dynamic switching power dissipation due to the load capacitance  $C_{Li}$  of each inverter within a clock path, and  $P_{sc}$  is the power dissipated due to the short-circuit current within each inverter. Both power components can be reduced by minimizing the load capacitance of the branches of a clock path, by implementing each branch with more than one inverter while maintaining the branch delay constant. Therefore, the total branch capacitance in (2) may be reduced by increasing the number of inverters within the branch. This process is illustrated in Figure 2, where the total branch capacitance  $C_B$  is shown as a function of the number of inverter stages and branch delays,  $t_{di}$ .



Figure 2:  $C_B$  versus number of stages for selected  $t_{di}$  values

As the number of inverter stages increases for a given delay, the total load capacitance of the chain of inverters decreases asymptotically. The total branch capacitance  $C_B$  is limited by the maximum number of inverter stages required to accurately implement a given branch delay  $t_{di}$ .

By reducing the total branch capacitance, both dynamic and short-circuit power dissipation are reduced. Dynamic power dissipation is reduced since the total capacitance on the branch is less (more inverters driving smaller capacitive loads rather than a single large inverter driving a large capacitive load). The short-circuit power component,  $P_{sc}$ , shown in (6) [8], is reduced since the transition time  $t_{Ti}$ , which is linearly dependent on the inter-stage load capacitance, is reduced.

$$P_{sc} = V_{DD} f t_{Ti} I_{DO} \frac{1}{(\alpha + 1)} \frac{1}{2^{\alpha - 1}} \frac{(1 - 2\nu_T)^{\alpha + 1}}{(1 - \nu_T)^{\alpha}}$$
(6)

A clock path within a clock distribution network typically shares branches with other clock paths. Therefore, the total clock path delay may not be equally divided among the inverters implementing the clock path. Furthermore, the insertion of extra inverters within a particular branch in order to reduce power dissipation should not change the signal polarity of the clock path. As a consequence, a design strategy, called method # 2, is proposed in which the number of inverters per branch is chosen such that the total power dissipation is reduced while preserving the polarity of the clock signal for each clock path within the clock tree.

Note in Figure 2 that the most significant gain in lowering the branch capacitance  $C_B$  occurs when a branch delay is implemented with two or three inverter stages. In the algorithm developed for implementing the design method # 2, a search is performed for all the clock signal paths with common branches, identifying which branches should be implemented with two or three inverter stages, such that the polarity of each clock signal path is preserved. Therefore, in design method # 2, each branch is implemented with either two or three inverter stages. However, the accuracy of implementing a branch delay value depends on the fabrication process, the dimensions of the transistors in the inverter, and the interconnect capacitance per branch. For the 1.2 µm CMOS process used to characterize the clock distribution networks presented in Section 4, the minimum propagation delay of an inverter stage is 1 ns, corresponding to an inverter driving a capacitive load of approximately 100 fF. Therefore, branches with delay values of at least 2 ns are candidates for method # 2, as illustrated in the clock path represented in Figure 3, where a savings of 26% in power dissipation is obtained by applying method #2.



Figure 3: Reduced power dissipation with added inverters

## 4. EXPERIMENTAL RESULTS

In Table 1, the power dissipation for methods # 1 and # 2 are compared for several arbitrary clock distribution networks. The second column depicts the number of storage elements driven by the clock distribution network. The third and fourth columns depict the total power dissipation, both dynamic and short-circuit, in a clock distribution network where the circuitry of the clock tree is implemented with design methods # 1 and # 2, respectively. The final column depicts the savings in power dissipation between the two methods. Note that the power dissipation is reduced in each clock distribution network by redesigning the clock branches with multiple inverters.

Table 1: Total power dissipation for methods # 1 and # 2

| Circuit | # clock | $P_{dy} + P_{sc} (mW)$ |            | Savings |
|---------|---------|------------------------|------------|---------|
|         | paths   | method # 1             | method # 2 | (%)     |
| cdn 1   | 20      | 9.3                    | 8.3        | 10.8    |
| cdn 2   | 38      | 16.8                   | 14.6       | 13.1    |
| cdn 3   | 52      | 20.2                   | 17.3       | 14.4    |
| cdn 4   | 45      | 23.3                   | 17.6       | 24.5    |

As described in Section 2, the objective of the circuit design phase is to synthesize the circuit structures within the clock tree such that the clock skew between any two clock paths is satisfied. In Table 2, the clock skew between registers for the first clock distribution network example in Table 1 is described. In column two, the scheduled clock skew determined from the first phase of the design methodology [6] for the pair of registers presented in column one is shown. The delay values obtained from SPICE circuit simulation, where the clock distribution network is implemented with the design methods # 1 and # 2, respectively, are depicted in columns three and four. Finally, the per cent error between the delay values for both design methods are provided in columns five and six.

Table 2: Accuracy of clock skew values for cdn 1

| Registers                        | Specified | Measured (ns) |               | Error (%)    |               |
|----------------------------------|-----------|---------------|---------------|--------------|---------------|
|                                  | Skew (ns) | method<br>#1  | method<br># 2 | method<br>#1 | method<br># 2 |
| $R_I - R_3$                      | 0.0       | 0.0           | 0.0           | 0.0          | 0.0           |
| $R_{15} - R_{16}$                | 3.0       | 3.07          | 3.12          | 2.3          | 4.0           |
| $R_{12} - R_{13}$                | -2.0      | -2.19         | -2.02         | 9.5          | 1.0           |
| $R_{17}$ - $R_{19}$              | 2.0       | 2.17          | 2.05          | 8.5          | 2.5           |
| R <sub>4</sub> - R <sub>14</sub> | 0.0       | 0.21          | 0.10          | -            | _             |
| R6 - R12                         | -2.0      | -1.83         | -2.05         | 8.5          | 2.5           |
| $R_5 - R_{13}$                   | -4.0      | -4.03         | -4.06         | 0.75         | 1.5           |

From Table 2 it is observed that by designing a clock distribution network with the design method # 2, the accuracy of the clock skew values is not degraded. In fact, in certain cases the accuracy is improved, since the delay model is more accurate with faster rise and fall signal transition times. The output waveforms, derived from SPICE, are illustrated in Figure 4 for the circuit presented in Figure 3. Thus, a more power and speed efficient design of a clock signal path would utilize multiple smaller inverters within each branch of the clock tree.

#### 5. CONCLUSIONS

VLSI/ULSI-based synchronous systems require the efficient synthesis of clock distribution networks in order to obtain higher clock frequencies, eliminate race conditions, and reduce power dissipation. In this paper, a new circuit design technique is presented for substantially reducing the dynamic and short circuit power dissipation of clock distribution networks.

The proposed technique for minimizing power dissipation implements the delay of each branch within the clock tree with multiple smaller inverters, reducing the effective capacitive load of the clock distribution network



Figure 4: SPICE output of the circuit in Figure 3

and the rise and fall times of the clock signals within the clock tree. Significant savings in power dissipation of up to 25% are reported, while the accuracy of implementing the desired non-zero clock skew values is preserved.

Thus, an effective design technique for reducing the total power dissipated in clock distribution networks is presented. This technique is currently being integrated into a synthesis methodology for designing high performance clock distribution networks for application to VLSI/ULSI-based synchronous digital systems.

# REFERENCES

- D. W. Dobberpuhl, et al., "A 200-Mhz 64-b Dual-Issue CMOS Microprocessor," *IEEE Journal of Solid State Circuits*, Vol. SC-27, No. 11, pp. 1555-1565, November 1992.
- [2] T. G. Noll and E. De Man, "Pushing the Performance Limits due to Power Dissipation of Future ULSI Chips," Proceedings of the IEEE International Symposium on Circuits and Systems, pp. 1652-1655, May 1992.
- [3] E. De Man and M. Schöbinger, "Power Dissipation in the Clock System of highly pipelined ULSI CMOS Circuits," Proceedings of the International Workshop on Low Power Design, pp. 133-138, April 1994.
- [4] K.-Y. Khoo and A. N. Willson, Jr., "Low Power CMOS Clock Buffer," Proceedings of the IEEE International Symposium on Circuits and Systems, pp. 355-358, May 1994.
- [5] J. L. Neves and E. G. Friedman, "Circuit Synthesis of Clock Distribution Networks based on Non-Zero Clock Skew," Proceedings of the IEEE International Symposium on Circuits and Systems, pp. 4.175-4.178, May 1994.
- [6] J. L. Neves and E. G. Friedman, "Synthesizing Distributed Buffer Clock Trees for High Performance ASICs," Proceedings of the IEEE ASIC Conference, pp. 126-129, September 1994.
- [7] S. Dhar and M. A. Franklin, "Optimum Buffer Circuits for Driving Long Uniform Lines," *IEEE Journal of Solid-State Circuits*, Vol. SC-26, No. 1, pp. 32-40, January 1991.
- [8] T. Sakurai and A. R. Newton, "Alpha-Power Law MOSFET Model and its Applications to CMOS Inverter Delay and Other Formulas," *IEEE Journal of Solid State Circuits*, Vol. SC-25, No. 2, pp. 584-594, April 1990.
- [9] N. Hedenstierna and K. O. Jeppson, "CMOS Circuit Speed and Buffer Optimization," *IEEE Transactions on Computer-Aided Design*, Vol. CAD-6, No. 2, pp. 270-281, March 1987.