Highlights of Design F

23-september-81

Design F is the fifth design I’ve made in the last couple weeks. The designs up to and including design F have sported increasing communication bandwidth. I have now achieved a bandwidth comparable to that of DMA channels, with the added flexibility of 4-cycle communication.

Design F also incorporates a feature for global communication. A controlling processor (ISBC 86/12) will provide a number of utility functions to the cube processors: a clock, a reset signal, a periodic interrupt to refresh rams, and general communication signals.

There are two general communication paths: a path from the controlling processor to the cube processors, and a shared path from each of the cube processors to the controlling processor. There are 3 data signals generated by the controlling processor that can be read by each of the cube processors. There is a 5 bit open collector bus connecting all of the cube processors to the controlling processor.

In addition, a switch is provided on each processor that can be read by the respective processors. A LED is provided on each processor that can be illuminated under processor control.

These capabilities, although somewhat irregular, allow some important functions. The periodic interrupt can be used to synchronize messages transmitted over the three input lines. The open collector output lines can be used for such purposes as deadlock detection. The LED will permit visual identification of a board if its position in the array is known (such as identification of defective boards). The switch will permit identification of a board's position in the array given that its physical position is known (i.e. the button is pressed).

I may be slightly, or greatly, off in my design. I do not rule out the possibility that the optimal design is with UART chips or FIO chips, but my inspirations in other directions have not been as good. There is also room for fine tuning of this design - it would be a valuable contribution to demonstrate how to save even one chip. Constructive advice and help will be greatly appreciated.

---

1. It is the fourth input bit
2. It is the sixth output bit
3. If a processor is idle it releases the line - if all processors are idle the line floats high and deadlock is assumed.
Highlights of Design F

23-september-81

Design F is the fifth design I've made in the last couple weeks. The designs up to and including design F have sported increasing communication bandwidth. I have now achieved a bandwidth comparable to that of DMA channels, with the added flexibility of 4-cycle communication.

Design F also incorporates a feature for global communication. A controlling processor (ISBC 86/12) will provide a number of utility functions to the cube processors: a clock, a reset signal, a periodic interrupt to refresh rams, and general communication signals.

There are two general communication paths: an path from the controlling processor to the cube processors, and a shared path from each of the cube processors to the controlling processor. There are 3 data signals generated by the controlling processor that can be read by each of the cube processors. There is a 5 bit open collector bus connecting all of the cube processors to the controlling processor.

In addition, a switch is provided on each processor that can be read by the respective processors. A LED is provided on each processor that can be illuminated under processor control.

These capabilities, although somewhat irregular, allow some important functions. The periodic interrupt can be used to synchronize messages transmitted over the three input lines. The open collector output lines can be used for such purposes as deadlock detection. The LED will permit visual identification of a board if its position in the array is known (such as identification of defective boards). The switch will permit identification of a board's position in the array given that its physical position is known (i.e. the button is pressed).

I may be slightly, or greatly, off in my design. I do not rule out the possibility that the optimal design is with UART chips or FIO chips, but my inspirations in other directions have not been as good. There is also room for fine tuning of this design - it would be a valuable contribution to demonstrate how to save even one chip. Constructive advice and help will be greatly appreciated.

---

1 It is the fourth input bit
2 It is the sixth output bit
3 If a processor is idle it releases the line - if all processors are idle the line floats high and deadlock is assumed.
This document describes an interprocessor communication proposal for the cube processor. This proposal is implemented in design F, 23-september-81.

**Design Goals**

**Speed**

Speed is the primary design goal. Problems can be devised where the performance of the entire machine will be proportional to the speed of the communications. There are, however, many important problems where the communications speed is of no practical importance.

**Cost**

Cost is a factor that is intertwined with the other goals and must be considered. The goal of the machine is to perform calculations at minimal cost. The speed or cost of an individual processor is not overwhelmingly important; a speed increase of 50% is unacceptable if the cost increase is 80%. The proposed design understands the importance of low cost, and disregards more powerful designs if the cost is much higher.

A less troublesome design goal, but an equally important one, is that the design be amenable to a high level software communications system. This goal requires that attention be given to such questions of queueing and deadlock at the design proposal phase. Were these questions to be ignored at this point then the extra software overhead required might annull any gain in hardware performance.

**Full Handshake Communications**

The last design goal will be satisfied by a full handshake communications system. Full handshake means that a message cannot be transmitted by a sender until the receiver is ready to accept it. Many common communications schemes are not like this: UARTS will overrun if the receiver is full, GPIB interfaces are similar.

Communications proposed for this design will involve FIFOs. A fifo is an asynchronous device that can be loaded and unloaded with data words. The device is capable of storing a fixed number of data words, and when it is full it indicates so and refuses to load further data.

Our implementation will have two fifos per communication link. A sending processor will examine one fifo to determine if it can accept another message, and if so will load another message. The message will be stored in fifos until a receiving processor checks its fifo to determine if at least one message is available. If a message is available it is read and removed from the receiver fifo. The transmitter and receiver fifos are connected by a communications link (a cable).

If the receiver stops reading from its fifo then the fifos may fill completely. When this happens the transmitter will stop transmitting.

**Data Path Size**

There are two very important considerations concerning the number of bits in the data
words. Wide data paths are undesirable because they require thick bundles of wire and expensive connectors. On the contrary, small data paths are undesirable because a processor should not be burdened with assembling small data words.

This proposal addresses these conflicting requirements by using a small data word on the communications link and a large word at the processor interface. There is special hardware to convert between small and large words.

The communication path that traverses the cables is four bits wide. The width of the cable is much larger than this, however: the link is bidirectional, and there are two handshake lines in each direction. Including four ground conductors, a 16 conductor cable is required.

The fifos are implemented as 16x4 fifos.

The communication to the processor is in 16 bit words. The processor will read or write an entire word in one cycle. Special hardware translates the 16 bits into 4 nibbles and cycles these individually to the fifos.

Interrupts

When a message is available in a fifo the processor is interrupted. The processor can then read the entire message from the fifo. Since the fifos are 4 bits wide they could conceivably contain 0-16 nibbles. If the processor did not know how many nibbles were in the fifos it would be required to check the fifo empty bit between each nibble. Such checking is unacceptably time consuming and has been eliminated.

The strategy for generating an interrupt is that a fifo be entirely full. If the fifo is full then there are 16 nibbles in the fifo and the processor can reliably read all 16 nibbles without looking.

The penalty for this speedup is that messages less than a full fifo long cannot be reliably sent. Such a message would partially full the fifo in the receiver and would remain there until another message came along to push it through.

A full sized message is now 64 bits.

Details of the Inter-fifo Communication

Communication over the cables occurs with 4-cycle asynchronous handshake timing. The fifo chips (almost) support the 4-cycle without additional logic. The only chips needed in addition to the fifos (besides one gate to solve the 4-cycle problem) are buffer chips to provide drive and hysteresis for the long cable.

---

1 an 8086

2 nibble is a term for a 4-bit word
Details of the 4 to 16 Bit Conversion

Conversion from nibbles to words is accomplished with a 4 bit wide, 4 bit long shift register. Conversion from nibbles to words is accomplished by clocking the register four times with a new nibble at the input each time. After the four clocks 16 bits are available at the parallel outputs of the shift register. Conversion from words to nibbles is accomplished by loading the 16 bits of the shift register in parallel and the shifting the register four times. Each shift load one nibble into a fifo.

Clocking for these conversions is generated by a state machine. The state machine is activated by either a io-read or io-write to any of the fifo addresses. The state machine, clocked by the system clock, immediately removes the ready line to the processor to extend the cycle. The state machine will then operate for 5 clock cycles, shifting the register four times and reading or writing in parallel. Following the five cycles the state machine releases the CPU and waits for another fifo operation.

The amount of delay of the processor is minimal. Normally, a cycle is 3 clock periods; the state machine extends the cycle to 6.$^3$

$^3$I think, maybe a few more
FIFO State Machine