# Design and Low Speed Testing of a Four-Bit RSFQ Multiplier-Accumulator

Quentin P. Herr, Nada Vukovic, Cesar A. Mancini, Kris Gaj, Qing Ke<sup>†</sup>, Victor Adler,
Eby G. Friedman, Andrzej Krasniewski<sup>††</sup>, Mark F. Bocko, and Marc J. Feldman
Department of Electrical Engineering, University of Rochester, Rochester, NY 14627

†TRW Space & Electronics Group, Redondo Beach, California

†Institute of Telecommunications, Warsaw University of Technology, Nowowiejska 15/19, 00-665 Warszawa, Poland

Abstract—We have designed and tested a four-bit RSFQ multiplier-accumulator, the central component of our decimation digital filter. The circuit consists of 38 synchronous RSFO cells of six types arranged into a rectangular systolic array fed by one parallel input and one serial input. Timing is based on a clock distribution counter-flow scheme simulated maximum clock frequency of 11 GHz. The circuit, fabricated at Hypres, Inc., contains 1100 Josephson junctions, has power consumption less than 0.2 mW, and area less than 2.5 mm<sup>2</sup>. The multiplier-accumulator has been tested at low frequency demonstrating full functionality and stable operation over a 24 hour testing period. This four-bit multiplier accumulator is one of the largest reported RSFQ circuits verified experimentally to date.

#### I. INTRODUCTION

RSFQ logic has emerged as a possible alternative to advanced semiconductor technologies for very low power, ultra-high speed digital and mixed-signal applications [1]. An area where RSFQ circuits could find immediate application is digital signal processing (DSP) [2]. The ubiquitous multiplier-accumulator (MAC) is at the heart of digital filters, correlators, and other complex DSP applications. We have designed and tested at low speed a four-bit multiplier accumulator, the central component of a decimation digital filter being developed at the University of Rochester [3].

Until very recently, the design of RSFQ circuits was limited to the basic logic gates [1], [4], [5] and simple repetitive structures [6], [7]. Recent advances in the optimization of basic cells [8], [9], clocking [2], [10], and the calibration of CAD tools for RSFQ logic [11], [12] have made possible the development of large scale RSFQ circuits comprising hundreds of gates and thousands of Josephson junctions (JJs). The largest and most complex RSFQ circuits verified experimentally to date include an SFQ-counting analog-to-digital converter [13], [14], and an FFT radix 2 butterfly circuit [2], [15]. Both of these circuits consist of over 1000 JJs, and include several types of RSFQ cells. Both circuits have been verified experimentally at low speed [13], [15] and, at least partially, at high speed [2], [14].

Our multiplier-accumulator is another example of large and complex RSFQ circuit; it is composed of 1100 Josephson junctions and comprises six different types of synchronous

Manuscript received August 25, 1996. Cadence, Diva, Verilog, and Verilog-XL are all trademarks of Cadence Design Systems, Inc.

Financial support for this research was provided in part by the University Research Initiative at the University of Rochester, sponsored by the Army Research Office under Grant No. DAAL03-92-G-0112.

RSFQ gates. Here we report successful low-speed operation high-speed verification is currently in progress [3]. Due to i regular architecture, this four-bit MAC can be easily expand to an eight bit version, with the number of junctions scalin proportionally with the number of bits.

In some ways, our multiplier accumulator is more comple than the main digital components of the two previously cite large scale RSFQ circuits. The digital decimation filter in the SFQ-counting analog-to-digital converter performs a simpl addition of bits from a serial input; and thus lacks an multiplication capability. The serial multiplier in the FF radix 2 butterfly circuit performs a single multiplication ( two *n*-bit words, as compared to multiple multiplications wit accumulation in our circuit. Our circuit is also better suited a a fast preprocessor for semiconductor-based integrated circuit The output of our multiplier-accumulator is read out i parallel at a rate orders of magnitude lower than the cloc frequency, whereas the output is read out at the cloc frequency in a serial multiplier. The use of an adde accumulator, as opposed to a carry-save adder [15], account for this difference.

Our circuit also has important features in common with the two digital circuits referred to above. All three circuits apply regular topology known as a systolic array [16]. A systoli array consists of a limited number of cells performing singl primitive operations connected into a regular structure i which data are exchanged only between physically adjacer cells. The input data are introduced to the systolic array one and reused many times. Additionally, the data and the partir results flow through the structure in at least two differen directions. Another common feature is the use of glob; clocking rather than asynchronous communication betwee the cells. As a result of the topology of a systolic array, th clock distribution network is very regular with the clock an the data flowing in either the same or in the opposite direct tions. Clock skew (i.e., the difference between the arrival tim of the clock signal at two cells in the circuit) is non-zero eve between two physically adjacent cells. The clock skey between opposite ends of the clock distribution network ca become large compared to the clock period of the circui therefore, several clock pulses may travel through the networ simultaneously. We believe these two design trends, a systoli array topology and global clocking, will continue to dominal the design of large scale digital RSFO circuits in the near future.

#### II. FUNCTION, STRUCTURE, AND OPERATION

The function of a multiplier accumulator is described by the equation





Fig. 3. Adder-accumulator. a) Mealy state-transition diagram. The lighter arrow shows a transition that does not appear during the normal operation of the cell. b) internal structure. Notation: CONF - confluence buffer, T1FF - T1 flip-flop, DRO - destructive read-out cell.

soid in the multi-parameter operating region of the cell. The lengths of the ellipsoid axes in the global parameter space are shown in Table I. Note that the lengths of the axes are similar for all gates, and are larger than the 3 $\sigma$  standard deviation of the respective global parameters. All basic gates have been tested independently and demonstrate robust low speed operation. Experimental and simulated margins for the global bias current *lb* are shown in Table I. The experimental margins are large, and the total experimental operating range of the global bias current, defined as the sum of the absolute values of the margins, shows only a small reduction compared to the simulated range.

#### IV. DESIGN OF THE CLOCKING CIRCUITRY

The multiplier accumulator is a fully synchronous circuit. It is composed of clocked RSFQ gates connected by Josephson transmission lines (JTLs) and splitters. The circuit was originally designed with two different clocking schemes [10]. In the counterflow clocking scheme (Fig. 1), the clock signal is distributed in the direction opposite to the direction of the data flow (bottom-up and from the left to the right). In concurrent flow clocking the clock and the data flow in the same direction. Two other global signals, *nctr* and the serial input  $c(i)_k$ , are distributed in both clocking schemes in the same direction as the clock signal.

Counterflow clocking was chosen to be implemented first, despite the lower maximum simulated frequency (11 GHz vs.

TABLE I
MARGINS OF BASIC GATES COMPRISING THE MULTIPLIER ACCUMULATOR

| Cell | $XL^a$ | XIcba | simulated Ib | experimental Ib |
|------|--------|-------|--------------|-----------------|
| PSR1 | 38%    | 53%   | -19% +39%    | -30% +11%       |
| PSR2 | 38%    | 53%   | -39% +34%    | -27% +24%       |
| AND  | 40%    | 55%   | -55% +48%    | -63% +31%       |
| AAC  | 38%    | 52%   | -22% +51%    | -33% +25%       |
| DRO  | 41%    | 56%   | -41% +56%    | -46% +67%       |

<sup>&</sup>lt;sup>a</sup> Percentages are lower bounds on the margins corresponding to the "axis lengths" figure of merit returned by MALT. XIcb and XL denote the global critical current with the bias current adjusted proportionally and the global inductance, respectively.

29 GHz), because (a) it requires a smaller number of JTL stages in the clock distribution network and the data paths, (b) it is easier to layout, and (c) its design is much less dependent on fabrication induced parameter deviations which are not yet well characterized.

With counterflow clocking, most interconnections in the clock distribution network and the data paths comprise the minimum number of JTL stages necessary to satisfy the minimum physical distance between the input/output pads of the cells in the layout. The most critical timing constraints, which limit the maximum clock frequency of the entire circuit, are imposed by the interconnections of the adderaccumulator cell, AAC, with the AND gate and the proceeding AAC. The correct minimum intervals must be preserved between the data pulses, a and b, and the control pulses, clkand rd. Additionally, because of the confluence buffer in the AAC (see Fig. 3b), a minimum interval must be preserved between the data pulses a and b. All these constraints must be met in the presence of the timing parameter variations resulting from fabrication inaccuracies [17]. Taking these variations into account, we estimate the maximum clock frequency to be 9 GHz, about 20% below the maximum simulated clock frequency.

The clocking circuitry was designed using logic (gate level) simulation. The simulation was first performed using the custom designed University of Rochester Superconducting Logic Analyzer, *URSULA*, as described in [18], and then repeated by functionally modeling the RSFQ gates in Verilog hardware description language (HDL) [11], followed by simulating the entire circuit in the Cadence Verilog-XL environment [12], [19].

### V. FABRICATION AND TESTING

The entire multiplier accumulator was laid out using the Cadence Virtuoso graphical environment [19] according to the Hypres, Inc., standard Nb process foundry design rules [20]. The layout of the circuit is shown in Fig. 4. It was fabricated at Hypres with the target junction critical current density of 1 kA/cm². The circuit was tested at low speed using our automated twenty-four channel data acquisition setup, controlled by a PC running Labview.

Our first version MAC was *not* developed according to the design procedure described in section IV. Rather, all interconnections in the circuit were laid out using the minimum number of JTL stages necessary to cover the distance between the cells. After the circuit had been sent for fabrication, the timing analysis performed using URSULA revealed a likely source of circuit failure. The interval between the nominal positions of the data pulses at the inputs a and b of the AAC was too small, only slightly greater than the minimum separation time of the confluence buffer in the AAC.

In fact, the circuit tests confirmed this analysis. The circuit worked correctly when the input sequence was chosen so that the critical data pulses did not appear at the inputs a and b of any AAC in the circuit in the same clock cycle. For input sequences that did require data pulses at both a and b in the same clock period, the circuit was unstable. These results confirm the necessity of carefully analyzing the timing of the



Fig. 1. Logic level schematic of the multiplier-accumulator. Notation: PSR1, PSR2 - parallel shift registers, AND - and gate, AAC - adder-accumulator, AAC1 - one-input adder-accumulator, D - destructive read-out cell.

$$y = \sum_{i=0}^{m-1} c(i) \cdot x(i),$$
 (1)

where m denotes the number of the accumulated products, x(i) and c(i) are the i-th four-bit multiplier and the i-th four-bit multiplicand, and y is the accumulated sum. Equation (1) can be rewritten as follows

$$y = \sum_{i=0}^{m-1} \left( \sum_{k=0}^{3} c(i)_{k} \cdot 2^{k} \right) \cdot x(i) = \sum_{i=0}^{m-1} \sum_{k=0}^{3} c(i)_{k} \cdot \left( 2^{k} \cdot x(i) \right) , \quad (2)$$

where  $c(i)_k$  denotes the k-th bit of c(i).

Our RSFO multiplier-accumulator (MAC) consists of 38 RSFQ subcells of six types arranged into a rectangular systolic array fed by one parallel input,  $x(i) = x(i)_3...x(i)_0$ , and one serial input,  $c(i)_k$ , as shown in Fig. 1. The multiplier x(i)is supplied at the parallel input every four clock cycles, four bits at a time. The multiplicand c(i) is delivered at the serial input every clock cycle, one bit at a time, least significant bit first. The first row of cells consists of two types of parallel shift registers (PSR1, PSR2). Its function is to generate in each clock cycle the appropriate multiple of the parallel input: x(i),  $2 \cdot x(i)$ ,  $4 \cdot x(i)$ , and  $8 \cdot x(i)$ . This is obtained by shifting the contents of the registers one bit left in every clock cycle. After four clock cycles, a new parallel input x(i+1) is loaded into the four registers PSR1, and the registers PSR2 are cleared. The second row of cells, a series of AND gates, multiplies the appropriate multiple of the parallel input,  $2^k \cdot x(i)$ , by the respective bit of the serial input,  $c(i)_k$ . Four products of the form  $c(i)_k \cdot (2^k \cdot x(i))$  are then added to the previous temporary result stored in the row of adder-accumulators. Thus, every

four clock cycles a new product 
$$c(i) \cdot x(i) = \sum_{k=0}^{3} c(i)_k \left(2^k \cdot x(i)\right)$$
 is

computed and accumulated. The maximum number of accumulated products m depends upon the number,  $\ell$ , of one-input adder-accumulators (AAC1). This dependence is given by

$$\lceil \log_2 m \rceil + 1 = \ell. \tag{3}$$

In our design (shown in Fig. 1),  $\ell=5$ , and so the maximum number of the accumulated products m is 16.

The twelve-bit parallel output y is read at 1/m of the parallel input frequency; that is every  $4 \cdot m$  clock cycles. Reading

the output is controlled by the *read* signal fed into the right-most adder-accumulator which stores the least significant bit of the output. Each subsequent bit of the output is read one clock cycle later than the previous one. This process is accomplished using a row of DRO cells, each of which delays the *read* signal by one clock cycle. Immediately after each output bit is read out, the corresponding adder-accumulator starts accumulating a new sum. Therefore, no additional clock cycles are required between two consecutive operations described by (1).

#### III. DESIGN AND TESTING OF BASIC CELLS

Our MAC consists of six different types of synchronous RSFQ gates: parallel shift registers, type 1 and 2 (PSR1, PSR2), an and gate (AND), a two-input adder-accumulator (AAC), a one-input adder-accumulator (AAC1), and a destructive read-out cell (DRO). In Fig. 2 we show the internal structure of both parallel shift registers. Parallel shift registers check for the logic state at the serial data input sin, delay this data by one clock cycle, and split it into two outputs - the serial output sout, and the parallel output pout. The difference between the two cells comes from their operation in the initialization phase. PSR1 loads the data from its parallel input - pin, whereas PSR2 is cleared by the zero state at the nclr input.

An adder-accumulator shown in Fig. 3 is the most complex gate used in the MAC. Its function is to add the logic state at the inputs a and b to the internal state of the cell IS. The least significant bit of the sum a+b+IS is accumulated as the next internal state of the cell; the more significant bit appears at the carry output and propagates to the next cell. The internal state of the adder-accumulator is read at the sum output, controlled by the control input rd. Our adder-accumulator design is based on the T1 flip-flop reported in [5], with the DRO cell synchronizing the carry output with the clk input.

The device parameters of all basic cells have been generated by the optimization program MALT [8]. (The complete schematics of all cells, the values of the operating parameters, and their individual margins are all available from our anonymous ftp site ftp://ftp.ee.rochester.edu/pub/uri/MAC). MALT operates by inscribing the largest possible ellip—



Fig. 2. Parallel shift registers PSR1 and PSR2. a) internal structure of the PSR1. The parallel input pin and the serial input sin are assumed to be never both active (i.e., in a logic one state) in the same clock cycle. b) internal structure of the PSR2. Notation: S - splitter, DRO - destructive read-out cell, CONF - confluence buffer, AND - and gate.



Fig. 4. The layout of the multiplier-accumulator.

RSFO circuits using logic level simulation, even when using a conservative counterflow clocking design [10]. Even before the first version was tested, a second version was designed and sent for fabrication. The timing problem was fixed by adding JTL stages to the clock and data paths between AND and AAC. The new version worked correctly for all input sequences. The global bias current margin was  $\pm 5\%$ . This multiplier-accumulator demonstrated correct and stable operation over a twenty-four hour testing period.

### VI. SUMMARY

We have designed and tested at low speed a four-bit multiplier-accumulator with the features presented in Table II. The circuit can work as a stand-alone unit, as part of a decimation digital filter, or as a component in other DSP circuits. The output rate of the circuit is much lower than the

TABLE II PRIMARY FEATURES OF A FOUR-BIT MULTIPLIER ACCUMULATOR

| Parameter                                                                               | Value                     |  |
|-----------------------------------------------------------------------------------------|---------------------------|--|
| Josephson junction number                                                               | 1097                      |  |
| within cells                                                                            | 773                       |  |
| interconnections & clock distribution                                                   | 324                       |  |
| RSFQ gates number                                                                       | 38                        |  |
| Power required                                                                          | 181 μW                    |  |
| within cells                                                                            | 116 µW                    |  |
| interconnections & clock distribution                                                   | 65 μW                     |  |
| Area                                                                                    | 2.6 x 0.8 mm <sup>2</sup> |  |
| Input/Output lines                                                                      | 23                        |  |
| data inputs                                                                             | 5                         |  |
| control and clock inputs                                                                | 5<br>3                    |  |
| bias lines                                                                              | 3                         |  |
| outputs                                                                                 | 12                        |  |
| Maximum simulated clock frequency                                                       | 11 GHz                    |  |
| Maximum simulated output rate                                                           | 172 MHz                   |  |
| Worst case maximum clock frequency in the presence of parameter variations (calculated) | 9 GHz                     |  |
| Measured global bias current margins                                                    | ±5%                       |  |

input rate and the clock rate, which makes it a good preprocessor for conventional semiconductor-based electronics. The entire circuit is composed of about 1100 JJs, has a power consumption below 0.2 mW, and is biased using only two separate bias currents with a third bias line reserved for the input/output circuitry. The circuit area is small enough that an eight bit version could be laid out using a standard 5 mm x 5 mm Hypres, Inc., integrated circuit die area.

Our multiplier accumulator is one of the largest and most complex digital RSFQ circuits verified experimentally to date. Its development was made possible by the efficient optimization of the constituent RSFO cells, careful timing analysis, and the use of a Cadence-based CAD environment. The worst case maximum frequency of the circuit in the presence of parameter variations is calculated to be 9 GHz, only slightly below the maximum simulated clock frequency of 11 GHz. High speed testing of the multiplier-accumulator is currently in progress [3].

## REFERENCES

- [1] K. K. Likharev and V. K. Semenov, "RSFQ logic/memory family: A new Josephson-junction technology for sub-terahertz-clock frequency digital systems," *IEEE Trans. Appl. Supercond.*, vol. 1, pp. 3-28, March 1991.
- O. A. Mukhanov, P. D. Bradley, S. B. Kaplan, S. V. Rylov, and A. F. Kirichenko, "Design and operation of RSFQ circuits for digital signal processing," Proc. 5th Int. Supercond. Electron. Conf., pp. 27-30,
- Nagoya, Japan, Sept. 1995. Q. P. Herr, et al. "High speed testing of a four-bit RSFQ decimation Q. P. Herr, et al. "High speed digital filter," this conference.
- S. V. Polonsky, et al. "New RSFQ circuits," IEEE Trans. Appl. Supercond., vol. 3, pp. 2566-2577, March 1993.
  S. V. Polonsky, V. K. Semenov, and A. F. Kirichenko, "Single flux,
- quantum B flip-flop and its possible applications," *IEEE Trans. Appl. Supercond.*, vol. 4, pp. 9-18, March 1994.

  O. A. Mukhanov, "Rapid Single Flux Quantum (RSFQ) shift register
- family," IEEE Trans. Appl. Supercond., vol. 3, pp. 2578-2581, March
- [7] O. A. Mukhanov, "RSFQ 1024-bit shift register for acquisition memo-
- Q. P. Herr and M. J. Feldman, "Multiparameter optimization of RSFQ circuits using the method of inscribed hyperspheres," *IEEE Trans.*
- Appl. Supercond., vol. 5, pp. 3337-3340, June 1995.

  [9] S. V. Polonsky, et al., "PSCAN 96: New software for simulation and opimization of complex RSFQ circuits," this conference.

  [10] K. Gaj, E. G. Friedman, M. J. Feldman, and A. Krasniewski, "A clock
- distribution scheme for large RSFQ circuits," IEEE Trans. Appl. Supercond., vol. 5, pp. 3320-3324, June 1995.
- [11] K. Gaj, E. G. Friedman, and M. J. Feldman, "Functional modeling of RSFQ circuits using Verilog HDL," this conference.
  [12] V. Adler, C.-H. Cheah, K. Gaj, D. K. Brock, and E. G. Friedman, "A Cadence-based design environment for single flux quantum circuits," this conference
- [13] J. C. Lin, V. K. Semenov, and K. K. Likharev, "Design of SFQ-counting analog-to-digital converter," *IEEE Trans. Appl. Supercond.*, vol. 5, pp. 2252-2259, June 1995.
  [14] V. K. Semenov, Yu. Polyakov, and D. Schneider, "Preliminary results the state of the
- on the analog-to-digital converter based on RSFQ logic," CPEM'96 Conf. Digest, Braunschweig, Germany, June 1996.

  [15] O. A. Mukhanov and A. F. Kirichenko, "Implementation of a FFT
- radix 2 butterfly using serial RSFQ multiplier-adders," *IEEE Trans. Appl. Supercond.*, vol. 5, pp. 2461-2464, June 1995.

  [16] G. M. Megson, An introduction to systolic algorithm design, Oxford
- Science Publications, Oxford, 1992.

  [17] K. Gaj, Q.P. Herr, M.J. Feldman, "Parameter variations and synchro-
- nization of RSFQ circuits," Applied Superconductivity 1995, Institute of Physics Conf. Series #148, Bristol, UK, 1995, pp. 1733-1736.

  [18] A. Krasniewski, "Logic simulation of RSFQ circuits," IEEE Trans.
- Appl. Supercond., vol. 3, pp. 33-38, March 1993
- Cadence Corporation, San Jose, CA, Cadence Openbook, 1993.
- "Hypres Niobium process flow and design rules," av. Hypres, Inc., 175 Clearbrook Road, Elmsford, NY 10523.