180 likes | 342 Vues
This research investigates the effects of transient glitches on inter-chip communication systems, aiming to improve robustness and reduce deadlock risks. Focusing on the SpiNNaker architecture, we explore the shortcomings of existing transmitter and receiver designs, proposing enhancements such as phase-insensitive converters and independent resetting mechanisms. Our simulations reveal significant reductions in deadlock occurrences and packet loss, while increasing throughput. Future work includes refining glitch resistance and examining back-pressure impacts, establishing a generalized circuit evaluation methodology.
E N D
Fault-Tolerant Delay-Insensitive Inter-Chip Communication Yebin Shi Apt Group The University of Manchester
Outline • SpiNNaker Inter-Chip interconnect • Basic Transmitter and Receiver • Potential Problems with the Designs • Robust Transmitter and Receiver • Future work and conclusion
Research Aims • Investigate the impact of transient glitches at inter-chip wires on the interface circuits. • Redesign the link interface circuits to increase glitch-resistance and avoid deadlock.
SpiNNaker • Network infrastructure: • 6 bidirectional inter-chip links • delay-insensitive on-chip and • inter-chip communication • Packets are variable-length, • serialized in 4-bit flits, • with end-of-packet marker • 1 Gb/s throughput per link
Inter-Chip Communication • On-Chip Network: • 3of6 data encoding • 4-phase (RTZ) handshake • separate data and control • channels • Inter-Chip Network: • 2of7 data encoding • 2-phase (NRZ) handshake • data and control in single • stream
Link Transmitter • data channel: pipeline for code and phase conversion • ctrl channel: merge EoP symbol into the data stream
Link Receiver • data channel: phase and code conversion pipeline • ctrl channel: Extract EoP symbols from stream
Glitch Impact on Simulation • Automatic packet data generation • CRC scheme included for result verification • Random generation of transient glitches • injected onto the inter-chip link • Single Event Upset (SEU) scenarios • Configurable frequency and duration of glitches • Frequency: up to ½ glitch/packet • duration scale: 0.1-2 ns • Extensive simulation • a large number of densely packed glitches over 1M packets • speed-up fault simulation
Fault effects in the Transmitter • Deadlock risks: • A transient glitch may corrupt a 2-of-7 symbol, • leading to handshaking failure. • Phase-sensitive phase converter. • Independent reseting.
Fault Effects in the Receiver • Deadlock risks: • A corrupted 2-of-7 symbol may prevent completion • of conversion to 3of6. • Independent reseting.
Deadlock in Receiver • a glitch occurs when dout_cd is in transit • a wrong value stored in the bottom latch • a conversion failure for next data conversion
Robust 2-ph to 4-ph Conversion reset signal not shown • phase-insensitive converter: • Used in 2-phase ack input to the Transmitter. • Used in 2-phase data inputs to the Receiver.
Robust Receiver Design • Phase-insensitive phase converter • Enhanced code converter and completion detector • Independent reset capability
Receiver Phase Converter acki also triggers the ack signal back to the transmitter
Code conversion with Priority Arbitration • support full set of 2-of-7 code • convert invalid symbols • into a valid one • stop propagation of invalid • symbols containing more than • 2 transitions
Independent Reset • An extra, possibly redundant, transition is created after reset in case the Tx is waiting for an acknowledge token. • The phase-insensitive converter for ack2 in TX absorbs the extra token if it is not needed.
Simulation results • Significantly reduced • deadlock occurrence. • worse packet loss. • trivial area overhead. • increased throughput. Simulation results for 1 million packets sent
Conclusions and Future work • Enhance the resistance to transient glitches in inter-chip links by replacing phase converters. • Avoid deadlocks by hardening completion detection modules in the receiver. • Remove corrupt symbols by applying an arbitration scheme for symbol conversions. • Allow independent chip resets without introducing deadlocks by sending safe, possibly redundant tokens (data or ack) on reset. • A generalized approach for circuit evaluation, including the computation of safety margins. • Investigation into the impact of back-pressure on glitch resistance.