# Synthesizing Distributed Buffer Clock Trees for High Performance ASICs José Luis Neves and Eby G. Friedman Department of Electrical Engineering University of Rochester Rochester, NY 14627 Abstract - An integrated design system is presented in this paper for synthesizing high performance clock distribution networks for application to high speed ASICs. An optimal clock skew schedule is determined which provides a set of localized non-zero clock skew values that improve both circuit performance and reliability. These clock skew values together with the functional hierarchy are used to design the topology of the clock distribution network and to determine the minimum clock path delays which satisfy the clock skew schedule. Distributed buffers targeted for CMOS technology are synthesized to emulate the delay values assigned to the individual branches of the clock tree. Maximum errors of less than 2.5% for the delay of the clock paths and 4% for the clock skew between any two registers belonging to the same global data path as compared with SPICE Level-3 are demonstrated for an example circuit. ### 1. Introduction Most existing Application-Specific Integrated Circuits (ASICs) utilize fully synchronous timing, requiring a reference signal to control the temporal sequence of operations. Globally distributed signals, such as clock signals, are used to provide this synchronous time reference. These signals can dominate and limit the performance of VLSI-based ASICs. This is, in part, due to the continuing reduction of feature size concurrent with increasing chip dimensions. Thus interconnect delay has become increasingly significant, perhaps of greater importance than active device delay. Furthermore, the design of the clock distribution network, particularly in high speed applications, requires significant amounts of time, inconsistent with the high design turnaround of the more common data flow synthesis phase of ASIC design methodologies. Several techniques have been developed to improve the performance and design efficiency of clock distribution networks, such as buffer insertion [1] to reduce propagation delay and power consumption of clock distribution networks, symmetric distribution networks [2], such as H-tree structures, and zero-skew clock routing algorithms [3,4] to automatically layout clock nets. A common disadvantage of these approaches is that the clock distribution network is designed so as to ensure minimal clock skew between each register, not recognizing that localized non-zero clock skew [5,6] can be used to improve circuit performance and minimize the likelihood of any race conditions. A novel methodology is therefore presented in this paper for efficiently synthesizing distributed buffer, treestructured clock distribution networks. This methodology is illustrated as part of the IC design process cycle in Figure 1. The IC design cycle typically begins with the System Specification phase. The Clock Tree Design Cycle utilizes timing information from the Logic Design phase, such as the minimum and maximum delay values of the logic blocks and the registers. Furthermore, the hierarchical description of the circuit netlist is used to form the topology of the clock tree. The output of the Clock Tree Design Cycle is a detailed circuit description of the clock distribution network. Design information, such as the branch location and the geometric width and length of the transistors within each buffer along the clock path, is determined. Figure 1: Block diagram of the IC design cycle The clock distribution design methodology is composed of four major phases. In the first phase, summarized in Section 2, a localized clock skew schedule is determined which maximizes circuit performance and reliability. In the second phase, described in Section 3, a topological design of the clock distribution network is obtained, producing a clock tree with minimum delay This research is based upon work supported by Grant 200484/89.3 from CNPq (Conselho Nacional de Desenvolvimento Científico e Tecnológico - Brasil) and the National Science Foundation under Grant No. MIP-9208165. values assigned to each branch. The third phase, presented in Section 4, the design of the circuit structures which implement the individual branch delay values is described. The final phase is the geometric layout of the clock distribution network. The layout phase is not addressed in this paper since it is directly dependent upon the physical floorplan of the functional circuit and existing clock distribution network layout strategies have been extensively described in the literature [e.g., 3,4]. Experimental results demonstrating the accuracy of this methodology are presented in Section 5 and some conclusions are drawn in Section 6. #### 2. OPTIMAL CLOCK SKEW SCHEDULING Assuming that the minimum and maximum delay characteristics of the combinational logic blocks and the registers are known, it is possible to determine the optimal clock skew of each local data path such that the clock frequency is maximized while ensuring that no race conditions exist [5,6,7,8]. This is accomplished by formulating the optimal clock scheduling problem as a linear programming problem and solving the set of linear inequalities with standard linear programming techniques [9]. As described below, the set of inequalities are derived from the timing relationships of each local data path in the circuit. The timing relationships for any local data path composed of two sequentially adjacent registers, $R_i$ and $R_j$ , are $$T_{Skewii} \ge T_{Holdi} - T_{PD(min)}$$ , (1) $$T_{PD(min)} = T_{C-Oi} + T_{Logic(min)} + T_{Int} + T_{Set-upj} \quad , \quad (2)$$ where $T_{Skewij}$ is the difference between the clock path delays $C_i$ and $C_j$ , $T_{Holdj}$ is the amount of time the input data must be stable after the clock signal changes state, $T_{PD(min)}$ ( $T_{PD(max)}$ ) is the minimum (maximum) propagation delay between registers $R_i$ and $R_j$ , $T_{C\cdot Qi}$ is the time required for the data to leave $R_i$ once it is triggered by the clock pulse $C_i$ , $T_{Logic(min)}$ is the minimum propagation delay through the logic block between the registers $R_i$ and $R_j$ , $T_{Int}$ is the interconnect delay between the registers $R_i$ and $R_j$ , and $T_{Set\text{-}upj}$ is the time required to successfully propagate to and latch the data within $R_j$ . Observe that (1) prevents the latching of the incorrect data signal into $R_j$ by the clock pulse that latched the same data into $R_i$ (i.e., creating a race condition). Furthermore, $$T_{CP} \ge T_{Skewij} + T_{PD(max)}$$ , (3) $$T_{PD(max)} = T_{C-Qi} + T_{Logic(max)} + T_{Int} + T_{Set-upj} \quad , \quad (4)$$ where $T_{CP}$ is the minimum clock period and $T_{Logic(max)}$ is the maximum propagation delay through the logic block between registers $R_i$ and $R_j$ . Equation (3) guarantees that the data signal latched in $R_i$ is latched into $R_j$ before the next clock pulse arrives at $R_j$ . In order to eliminate race conditions between different VLSI circuits controlled by the same clock source, the nonzero clock skew must not be propagated beyond the circuit I/Os. Therefore, the off-chip clock skew is constrained to satisfy the following relationship, $$T_{Skew\ in,out} = T_{Skew\ in,l} + \cdots + T_{Skew\ n,out} = 0$$ . (5) Equations (1) and (3) for each local data path and (5) for each global data path are sufficient conditions to determine the optimal clock skew schedule such that the overall circuit performance is maximized while eliminating any race conditions. ### 3. TOPOLOGICAL DESIGN The topological synthesis phase of the clock distribution network design methodology is composed of three steps. In the first step, the minimum clock path delay of each register in the circuit is determined from the specified clock skew schedule. In the next step, the topology of the clock distribution network is determined from the hierarchical description of the circuit, creating a clock tree where the branching points correspond to each hierarchical level. In the final step, delay values are attached to each branch of the clock distribution network, satisfying the initial clock skew assignment. A detailed description of each step can be found in [10]. To illustrate this synthesis method, the topology of a clock distribution network of an example circuit composed of twenty registers is depicted in Figure 2. The numbers in parenthesis are the original clock skew specifications derived from the optimal clock scheduling phase, while the numbers in brackets are the minimum delay values assigned to each branch. Figure 2: Clock delay and skew assignment for an example clock distribution network ## 4. DESIGN OF CIRCUIT DELAY ELEMENTS The circuit structures are designed to emulate the delay values associated with each branch of the clock tree. Special attention is placed on guaranteeing that the clock skew between any two clock paths is satisfied rather than satisfying each individual clock path delay. The successful design of each clock path is primarily dependent on two factors: 1) isolation of each branch delay using active elements, specifically CMOS inverters, and 2) the use of repeaters to permit the integration of inverter and interconnect delay equations so as to accurately calculate the delay of each clock path. The interconnect lines are modeled as purely capacitive lines by inserting the inverting buffers into the clock path such that the output impedance of each inverter is significantly greater than the resistance of the driven interconnect line [11]. As a consequence, the slope of the input signal of a buffer connected to a branching point is identical to the slope of the output signal of the buffer driving that branching point [12]. The delay equations describing the inverting repeater are used to determine the geometric dimensions of the transistors and are based on the MOSFET $\alpha$ -power law I-V model developed by Sakurai and Newton [13]. The time $t_d$ from the 50% $V_{DD}$ point of the input waveform to the 50% $V_{DD}$ point of the output waveform is defined as the delay of the circuit element. Equation (6), derived from [13], describes the delay of a CMOS inverter in terms of its load capacitance $C_I$ . $$C_L = \frac{2I_{DO}}{V_{DO}} \left[ t_d - \left( \frac{1}{2} - \frac{1 \cdot v_T}{1 + \alpha} \right) s_r \right], \quad \text{where } v_T = \frac{V_{th}}{V_{DD}}$$ (6) where $I_{DO}$ is the drain current at $V_{GS} = V_{DS} = V_{DD}$ , $V_{DO}$ is the drain saturation voltage at $V_{GS} = V_{DD}$ , $V_{th}$ is the threshold voltage, $\alpha$ is the velocity saturation index, and $V_{DD}$ is the power supply. The output waveform of the driving inverter is the input signal to all the branches connected to this inverter and is approximated by a ramp shaped waveform. This approximation is achieved by linearly connecting the points 0.1 $V_{DD}$ and 0.9 $V_{DD}$ of the output waveform and is accurate as long as the interconnect resistance is negligible as compared to the inverter output impedance. The output slope of the buffer is $$s_r = \frac{t_{0.9} - t_{0.1}}{0.8} = \frac{C_L V_{DD}}{I_{DO}} \left( \frac{0.9}{0.8} + \frac{V_{DO}}{0.8 V_{DD}} ln \frac{10 V_{DO}}{e V_{DD}} \right) . \quad (7)$$ For a given clock signal path, if the assigned delay of branch $b_i$ is much greater than the assigned delay of the following branch $b_{i+1}$ (see Figure 3), (6) and (7) are no Figure 3: Buffer resizing longer valid. Therefore, the geometric size of the buffer within branch $b_i$ is increased to decrease the transition time of the output waveform ( $I_{DO}$ increases, $s_r$ decreases). The geometric size of a buffer is determined from the following relationship between the drain current and the transistor width [13], $$W_{new} = \frac{I_{DO(new)}}{I_{DO(measured)}} W_{measured} \qquad . \tag{8}$$ Equations (6), (7), and (8) provide the necessary relationships to design the buffer elements along each clock signal path. #### 5. EXPERIMENTAL RESULTS The accuracy of this clock distribution network design methodology is measured by comparing the circuit implementation with SPICE. Specifically, the individual clock path delay and clock skew values determined in the clock skew scheduling, topological, and circuit synthesis phases are compared with delay values derived from SPICE. Table 1 compares the difference between the calculated and measured clock path delays for the clock topology shown in Figure 2. The second column depicts the desired delay obtained from the topological and circuit synthesis of the clock distribution network. The third column shows the delay values of each clock path derived from SPICE circuit simulation using Level-3 device models, while the fourth column depicts the per cent error between the calculated and the numerically derived delay. Note that the maximum error for this example is less than 2.5%. Table 1: Comparison between calculated and measured clock path delay | Clock Path | Delay (ns) | SPICE (ns) | Error | |--------------------|------------|------------|-------| | $R_1, R_2, R_3$ | 7.0 | 6.84 | 2.3% | | $R_{15}, R_4, R_9$ | 7.0 | 7.11 | 1.6% | | R <sub>18</sub> | 8.0 | 8.10 | 1.3% | | $R_{19}, R_{20}$ | 7.0 | 7.04 | 0.6% | | $R_5, R_6, R_{16}$ | 4.0 | 4.06 | 1.5% | | R <sub>17</sub> | 8.0 | 8.06 | 0.8% | | R <sub>14</sub> | 7.0 | 7.17 | 2.4% | | $R_{II}$ | 7.0 | 7.16 | 2.3% | | $R_{I2}$ | 6.0 | 6.14 | 2.3% | | R <sub>13</sub> | 8.0 | 8.20 | 2.5% | | $R_7, R_{10}$ | 7.0 | 6.90 | 1.4% | | $R_8$ | 9.0 | 8.90 | 1.1% | A more significant measure of the effectiveness of this clock distribution network design methodology is to guarantee that the clock skew between any pair of registers in the same global data path is accurately implemented rather than the delay of each individual clock path. Table 2 illustrates the clock skew between registers for the same circuit example. Column two shows the scheduled clock skew implemented with the design methodology described herein for the pair of registers presented in column one. Column three depicts the values obtained from SPICE circuit simulation, while column four shows the per cent error between both measurements. Note that the maximum error for this example is 4%, a number well within practical and useful limits. Table 2: Comparison between specified and measured clock skew values | Registers | Specified Skew (ns) | Measured (ns) | Error | |-----------------------------------|---------------------|---------------|-------| | $R_1 - R_3$ | 0.0 | 0.0 | 0.0% | | R <sub>15</sub> - R <sub>16</sub> | 3.0 | 3.0 | 0.0% | | $R_{12} - R_{13}$ | -2.0 | -2.06 | 3.0% | | $R_{17}$ - $R_{19}$ | 1.0 | 1.03 | 3.0% | | $R_4 - R_{14}$ | 0.0 | 0.06 | | | $R_6 - R_{12}$ | -2.0 | -2.08 | 4.0% | | $R_5 - R_{13}$ | -4.0 | -4.14 | 3.5% | The individual data paths have been selected to illustrate several types of clock skew situations, such as non-zero clock skew between registers in the same data path or in separate data paths. The first three examples illustrate situations where the clock skew between two registers is dependent only upon the delay of the final branches of each clock path. The remaining examples illustrate situations where the clock skew is dependent upon the delay of other branches of the network besides the final branches. Observe that in the last examples, the error tolerance of the clock skew is still within acceptable margins, exhibiting good accuracy, even for those paths in which an insignificant portion of the clock distribution network is common. ## 6. CONCLUSIONS ASIC-based synchronous systems require the efficient synthesis of high speed clock distribution networks in order to obtain higher levels of circuit performance and improved design turnaround. In this paper, circuit performance is improved by using non-zero localized clock skew to reduce the minimum clock period and to eliminate race conditions. Distributed buffers are included during the synthesis process, permitting the clock distribution network to be optimized for the specific performance requirements of the circuit application. An integrated methodology is presented for synthesizing clock distribution networks and is divided into four phases, 1) optimal clock scheduling, 2) topological design of the clock distribution network, 3) design and modeling of the circuit delay elements, and 4) layout implementation. The first three phases have been implemented and the experimental results show excellent agreement between the calculated clock skew and the clock skew values obtained from SPICE. This synthesis method produces the clock distribution network concurrently with the data flow portion of the circuit. As a consequence, the layout of the clock distribution network can be performed as part of the automated placement and routing of the ASIC. Thus, an integrated methodology for synthesizing treestructured clock distribution networks for high speed ASIC circuits is presented. This methodology, based on inserted delay elements, accurately synthesizes localized non-zero clock skews, thereby increasing the system clock frequency and eliminating race conditions. #### REFERENCES - [1] S. Pullela, N. Menezes, J. Omar, L. T. Pillage, "Skew and Delay Optimization for Reliable Buffered Clock Trees," Proceedings of the IEEE International Conference on Computer-Aided Design, pp. 556-562, November 1993. - [2] H. B. Bakoglu, J. T. Walker, and J. D. Meindl, "A Symmetric Clock-Distribution Tree and Optimized High-Speed Interconnections for Reduced Clock Skew in ULSI and WSI Circuits," Proceedings of the IEEE International Conference on Computer Design, pp. 118-122, October 1986 - [3] T.-H. Chao, Y.-C. Hsu, J.-M. Ho, K. D. Boese, and A. B. Kahng, "Zero Skew Clock Routing with Minimum Wirelength," *IEEE Transactions on Circuits and Systems-II: Analog and Digital Signal Processing*, Vol. CAS-39, No. 11, pp. 799-814, November 1992. - [4] R.-S. Tsay, "An Exact Zero-Skew Clock Routing Algorithm," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. CAD-12, No. 2, pp. 242-249, February 1993. - [5] E. G. Friedman, "Clock Distribution Design in VLSI Circuits - an Overview," Proceedings of the IEEE International Symposium on Circuits and Systems, pp. 1475-1478, May 1993. - [6] J. P. Fishburn, "Clock Skew Optimization," IEEE Transactions on Computers, Vol. C-39, No. 7, pp. 945-951, July 1990. - [7] K. A. Sakallah, T. N. Mudge, O. A. Olukotun, "checkTc and minTc: Timing Verification and Optimal Clocking of Synchronous Digital Circuits," Proceedings of the IEEE/ACM Design Automation Conference, pp. 111-117, June 1990. - [8] T. G. Szymanski, "Computing Optimal Clock Schedules," Proceedings of the IEEE/ACM Design Automation Conference, pp. 399-404, June 1992. - [9] C. H. Papadimitriou and K. Steiglitz, Combinatorial Optimization Algorithms and Complexity, Prentice-Hall, 1982. - [10] J. L. Neves and E. G. Friedman, "Topological Design of Clock Distribution Networks Based on Non-Zero Clock Skew Specifications," Proceedings of the IEEE 36th Midwest Conference on Circuits and Systems, pp. 468-471, August 1993. - [11] S. Dhar and M. A. Franklin, "Optimum Buffer Circuits for Driving Long Uniform Lines," *IEEE Journal of Solid State Circuits*, Vol. SC-26, No. 1, pp. 32-40, January 1991. - [12] J. L. Neves and E. G. Friedman, "Circuit Synthesis of Clock Distribution Networks Based on Non-zero Clock Skew," Proceedings of the IEEE International Symposium on Circuits and Systems, May 1994 (in press). - [13] T. Sakurai and A. R. Newton, "Alpha-Power Law MOSFET Model and its Applications to CMOS Inverter Delay and Other Formulas," *IEEE Journal of Solid State Circuits*, Vol. SC-25, No. 2, pp. 584-594, April 1990.