340 likes | 519 Vues
a HL-LHC CMS pixel detector ASIC in 65nm technology. Jorgen Christiansen, CERN/PH-ESE Elia Conti, Perugia. Outline. Driving requirements 65nm for HL-LHC pixel Optimization of pixel regions Global architecture Alternatives Flexibility Planning Summary. Driving pixel ASIC requirements.
E N D
a HL-LHC CMS pixel detector ASIC in 65nm technology Jorgen Christiansen, CERN/PH-ESE Elia Conti, Perugia
Outline • Driving requirements • 65nm for HL-LHC pixel • Optimization of pixel regions • Global architecture • Alternatives • Flexibility • Planning • Summary
Driving pixel ASIC requirements • Pixel size • Driven by physics and what can be made by technology (Readout chip). • Sensor type: 2D Si, 3D Si, Diamond • Radiation damage • It will probably take quite some time to have a decision on this -> keep flexibility • Connection to sensor. • Capacitance, Bump bonding, Thinning, Polarity, Collected charge, Cluster size, leakage, etc. • Analog performance • Threshold and variation, Noise, ADC/TOT/Bin, etc. • Data (hit) rate, data buffering and triggering • Pixel trigger with ROI ?. • Multiple modes: Triggered, Trigger less, Self triggered, Test, Calibration, etc. • This is what technology can “buy” us with “intelligent pixels”.(if we do not want too small pixels) • Readout • Control • Low power –> Material budget • On-chip/on-module power conversion • System (hybrid) integration • Radiation tolerance (TID, Neutrons, Hadrons, SEU) Hybrid pixel with deep sub-micron (65nm) technology for readout chip.
What we can get from 65nm NMOS Vt shift • Radiation tolerance (TID, Displacement, SEU): • TID has now been “demonstrated” • No need for ELT (Enclosed Layout Transistors) • SEU to be handled with redundant logic where needed • We need: ~10MGy, 1016neu/cm2 (EXTREME !) • Can we use small standard cells ? • To be confirmed • Large amount of digital logic/memory: • Vital for small pixels, high date rates, buffering, flexibility. • Logic density: 250nm: 1, 130nm: ~4x, 65nm: ~16x • For radiation environments 250nm was in fact 2-4x worse as ELT required, so logic density improvement from 250nm to 65nm is 32-64x • Speed: 250nm: ~1, 130nm: ~2x, 65nm: ~4x(Where high speed not needed one can get lower power by running at lower Vdd) • Low power (digital): • Low supply voltage: P ~ Vdd2 • Multiple low power libraries (Vdd, High Vt, ) • 250nm: 1, 130nm: ½ - ¼, 65nm: 1/8 - 1/16 • Many metal (cu) layers: • Power distribution, Signal distribution, columnbusses, etc. NMOS leakage S. Bonacini, P. Valerio CERN TWEPP2011
65nm CMOS technology • Mature technology (~10 years), available and very well known technology with good technology support (tools, libraries, IPs, MPW, production). Known as a strong technology node that will be available many more years (automotive, industrial , ,) • High yield, accurate simulation models , , , • Not an “exotic” 3D technology: Availability, Density, Yield , Design tools, • (Coarse TSV’s can be used at periphery if: available, appropriate, affordable and sufficient yield) • Analog • Good low noise and low power amplifiers can be made for pixels (small dynamic range, limited linearity) • Improved matching (at same transistor sizes) • Triple well deep implant and high resistivity substrate enabling isolation of critical analog parts from digital within pixels (have shown very good results in 130nm) • Details of this still to be better understood/tested for 65nm. • But we also get: • “High” NRE costs ( ~2 x 130nm for 4x higher density and ~½ power) • Lower supply voltage (but this is what gives low power) • Critical power distribution architecture (Local DC-DC and/or serial powering) • Higher gate and transistor leakage: Can be handled with appropriate design approaches • 12 inch wafers (bump-bonding and testing infrastructure), option of 8 inch ?. • Details of fine pitch bump bonding to be understood/tested (wafer size, UBM , , ,) • Pixel detectors are our IC technology drivers • 65nm technology access, MPW runs, tools, radiation qualification, etc. via CERN (and Europractice). • Or go below 65nm ? : • Cost, Radiation tolerance, ? Gain (power, functions) ?, Do we have time ?, Do we need it ? , ,
How could a 65nm pixel ASIC look like • ~50 um x ~50um Pixels (same area: 25 x 100, 35 x 70)or 50um x 100um ( 2x area) • How small pixels are advantageous in high rate LHC environment ?. • Resolution limited by multiple scattering in beam pipe and pixel detector itself. • Low power pixel ASIC critical ! • Small pixels: Consider binary readout (50um/√12 = ~14um, 35um /√12= ~10um) • “Large pixels”: Use of ADC/TOT for interpolation • Large pixels allow extended buffering and flexibility • Also feasible in a 50x50um pixel cell if appropriate architecture/technology ?. • When/how will pixel size be determined ?. • Major constraint for pixel ASIC design • ASIC design with moving targets is recipe for long design time and mistakes • Programmable pixel size: Dream ?, Function penalties, power penalties, Multiple applications • Pixel array size: > 20 x 20 mm (>200k pixels): • Limited by reticule size ( max 24 x 36 mm ), No stitching (yet) • ~2GHz hits/cm2 (~500MHz/cm2 track rate and cluster size = ~4) • 50 (100)kHz hit rate per pixel -> no fast shaping needed, low power • Digital corrections for analog imperfections ( e.g. , time walk, threshold variation) • Modes: Triggered/Trigger less, TOT/Binary, Testing modes, , • Trigger: Latency <25us (B-ID width: ~10bit) , Rate <1MHz (readout limited) • Contribution to trigger ? • “In-pixel” digital storage and processing (vital for high rate)
65nm pixel • Analog FE + adjust DAC’s: <~½ of 50x50um2 pixel • First FE/DAC prototypes have been demonstrated (LCD-CERN, ATLAS-LBNL) • ~4 bit TOT amplitude measurement • TOT with basic master clock (40MHz) • Dead time loss: 50khz * 16(max)*25ns = 2% (average ~0.5%) • Shorter dead-time using time interpolation in pixels requiredGet MIPS dead time < 50ns • Or low power SAR ADC (Turino) or Binary with no dead time. • Remaining ~½ for digital. Single pixel can contain the following digital: • 1/3: Flip-flops/registers (1.8 x 3.8um) = ~45 • 1/3 logic: NAND4 (1.8 x 1.4um) = ~125, • 1/3 SRAM (1.05 x 0.5um) = ~600 (in practice much less for small memory) (assuming using small size standard cell library and 75% area utilization) (forgetting about area penalty to separate analog and digital) Marginal to implement buffering, logic, multiple modes, SEU protection, etc. Local Pixel Regions (PR, or super pixels or , ,) to optimize local clustering and enable efficient local digital processing, storage and readout. Preamplifier P. Valerio CERN/LCD
65nm pixel • Hit storage in pixels / pixel regions • High density column data “busses” to get data out of pixel array. • End of column: Date merging, checking, formatting, compression, Readout. • High speed low power serial link(s) for readout • 4-10 pixel chips feeding one (LP)GBT with local 320 - 640Mbits/s link(or on-chip Gbits/s serializer) • Opto part located ~1m from pixel detector as opto components can most likely not survive in extremely harsh radiation environment. • Full SEU protection of critical parts • Clear identification of critical and non critical parts vital • Power density: <1 W/cm2 • Major design goal determining material for services (power, cooling) • Power conversion on-chip ?, Serial / DC-DC ? • Design complexity will come from low power complex digital with SEUs interleaved with low power, low noise analog • Potential pixel trigger will have a significant impact on the digital architecture. • Required buffering and logic can likely fit in a 65nm chip • System aspects critical (Bandwidth, Links, Architecture, protocol, ,)
Pixel region optimization • Grouping pixels in regions is critical to have enough local digital area for efficient buffering, logic, routing, low power features, different operating modes, , , • HEP pixel hits are naturally clustered • Cluster sizes: • Middle of barrel: ~Square clusters, 1 – max 9 (3x3), Average ~4 (just an example) • End of barrel: Very elongated clusters from track angle Strong dependence on pixel size, sensor, sensor thickness, radiation damage, track angle, detector angle, Lorentz angle from magnetic field, , . • Local clustering enables data reduction (“compression”) • What is “optimum” pixel region organization: • Pixel regions hit per cluster • Static / dynamic pixel regions • Required buffers per pixel/pixel region • Buffer size per pixel region and per pixel • Memory bits required and optimization • Data organization and readout bandwidth Simple statistical “model” to get basic “feeling” Architecture / hits simulation for detailed optimization
Middle barrel • Initial cluster assumption: Central hit with max +1 periphery • Cluster size Distribution Average = 4.2 (just one example) • Pixel Regions (PR) hit per track: • PR = 1 x 1: 4.2 • PR = 2 x 2: 2.6 (ATLAS FEI4) • PR = 2 x 4: 2.2 • PR = 3 x 3: 2.1 • PR = 2 x 8: 2.0 • PR = 4 x 4: 1.8 • PR = 5 x 5: 1.6 • PR = 4 x 8: 1.6 • PR = 8 x 8: 1.4
End of barrel M. Swartz, “A Detailed Simulation of the CMS Pixel Sensor,” CMS Note 2002/027 • Very elongated clusters • Simplified 1-dimensional clusters: 1 – 8 hits • Comparison for different cluster size distributions 100 x 150um2 Approximative formula: n = PR width h = number of hits in cluster
Pixel region and cluster shapes • Cluster shapes: • Square: 3x3 • Elongated: 4x3 • Long linear: 8x1 • Region shapes: • 1x1, 2x2, 4x4, 2x8, 8x2 • Distributions • Single=1, Avg=4.4, Avg=6.6, Max
PRs hit per cluster • For square pixel regions no major variation with cluster shape (when having same number of average hits/cluster) • If non square pixel region the shape must (obviously) follow the shape of the cluster • May not be the case in practice
PRs hit per cluster • Number of pixel regions hit per cluster • Significant reduction from 1x1 to 2x2 • Limited reduction from 2x2 to 4x4 • Above 4x4 ~no reduction
Memory required for latency buffering • Track rate: 500MHz/cm2 (Hit rate: 2GHz/cm2), Latency: 6.4us, Buffer overflow < 10-4 • Having large buffers shared by multiple pixels is much more efficient than independent small dedicated buffers. • But each buffer must be wider: Store date for multiple pixels 4 buffersrequired 19 buffersrequired instead of 4x(8x8)=256!
Relative number of buffers • Reduction in number of buffers is a basic statistical property that does not depend on cluster shape. • Shared buffers more efficient • Plateau reached at 8 (2x4) – 16 (4x4) pixels per pixel region
Memory bits • Less buffers required when pixel region gets bigger • BUT, more memory bits needed per buffer. • 10 bit bunch ID (shared between pixels in pixel region) • 4 bit ADC/TOT per pixel (unless using a dynamic allocation of memory “words”) • Pixel regions should not be larger than ~size of clusters • Optimum reached between 2x2 – 4x4 • Non square pixel region: Pixel region shape must (obviously) be in same direction as cluster
Pixel region optimization • Sufficient logic area to implement buffering and required functions efficiently • Minimize required number of data buffers • Minimize required number of memory bits • Minimize power consumption • Mixed signal layout considerations Not sacrificing performance • Minimize dead-time • Minimize data volume to read out • Not optimize strongly for a certain sensor type and specific application/location • Cluster size and shape • Hit rate • Sensor type and configuration of pixel detector may change several times during the R&D for a complex pixel ASIC.
Pixel region optimization • 2x2 - 4x4 pixel region optimal to minimize buffer memory bits • Pixel Regions of 4 x 4 allows more/better sharing across pixels. • Get sufficient digital resources in each pixel region • ~½ for analog and required adjustment DAC’s • ~½ for synthesized digital (75% utilization): 1/3 registers (~720), 1/3 logic (~2000 NAN4), 1/3 SRAM (<10Kbit)~100k transistors per pixel regionSophisticated local pixel processing/storage !. • Reduced buffering required • Reduced static and dynamic power ?. • Architecture should not be too highly optimized for a particular detector configuration, sensor, location, clustering, etc. • Use technology to build a pixel architecture that is flexible, high rate performance and low power. • Initial assumptions may (will) not be correct. How to organize a mixed signal floor plan for 4x4 ?
Pixel Region 4 x 4: A • Maximize effective area for digital • Use of automated synthesis and P&R • Analog low noise islands • Shielded minimum length low capacitance connections to bump pads. • Analog power distribution • Common biases distribution • Minimize cross talk from digital • Substrate isolation with deep implant • Surround analog islands with quiet logic (configuration, etc.) • Quite logic and shielding below analog distribution signals: Power, biases, global threshold, etc. • Organize digital: • Pixel hit processing and merging • Buffering • Column bus interface • Critical shielding of digital noise to sensor, bump pad and line to bump bad. • Global routing optimization: Analog Power, Analog biases, digital power, timing control, configuration, readout, etc. ,
Pixel region 4 x 4: B • Analog “stripes” • Good for distribution of analog references and analog power • Minimize crosstalk • Digital “stripes” • Good for distribution of timing signals, column busses and digital power. • One regular digital zone for P&R digital • Routing from bump-pad could be problematic • Similar approach considered for LHCb VELO • High rate, trigger less
General 4 x 4 architecture Power Rows: 128 PR’s = 512pixels DAC • Pixels: 4 x 4 x ~128 x ~128 = ~256k (262144) • Chip size = ~50um x 4 x 128 = ~2.6cm x ~3cm (Yield maximization required) • Obviously resembles LHCb/ALICE, FEI4, LHCbVelopix and other high rate pixels • And any other data driven (HEP) chip/system: System on a chip Config DAC Config DAC Region proc. B-ID tag Config Hit Proc. TOT TW comp. Etc. Config. int DAC B-ID Monitoring PR: 4 x 4 Trigger match Control Col.Bus Int. Readout Interface EOC Con. Columns: 128 PR’s = 512pixels
Basic CMS HL-LHC assumptions • Average cluster size: ~4 (worst case ?) • Rate: Worst case HL-LHC (layer locations as in Phase1) • Layer 1 (3.0cm): ~500MHz/cm2 tracks -> ~2GHz/cm2 hits • Layer2 (6.8cm): ~½ of layer 1 • Layer3 (10.2cm): ~½ of layer 2 -> ~¼ of layer 1 • Layer4 (16.0cm) : ~½ of layer 3 -> ~1/10 of layer 1(50MHz/cm2 tracks, 200MHz/cm2 hits) • End-caps ? • Pixel chip: ~6cm2 • Pixel size: ~50x50um2 = 2500um2 (or 25um x 100um) • Or 50um x 100um if better optimized for HL-LHC • Pixel regions: 4 x 4 • Pixels per chip: ~256k (262144) • Tracks/hits per chip per Bx: • Layer 1: 75 tracks per Bx (300hits/Bx) • Layer 4: 7.5 tracks per Bx (30hits/Bx) • L1 Trigger rate: 100kHz (200kHz) • Option of 1MHz ? (see later)
Readout (assuming 4x4 PR) • (LP)GBT user bandwidth: 3.2 Gbits/s (6.4Gbits/s a possible future option) • 10 E-links @ 320Mbits/s • 20 E-links @ 160Mbits/s • (40 E-links @ 80Mbits/s) • (10 E-links @ 640Mbits/s option for future LPGBT) • Event header: 32 bit event header (B-ID , E-ID, ?) • PR Full TOT: 16x4b TOT, 14b PR address • Layer 1: ~1060Mbits/s, 2-4 pixel chips per LPGBT • Layer 4: ~109Mbits/s, ~20 pixel chips per LPGBT • PR Variable TOT per PR: hits x (4bit pixel ID + 4bit TOT), 14b PR address(Data reformatting/compression done in EOC) • Layer 1: ~450Mbits/s, 4-8 pixel chips per LPGBT • Layer 4: ~48Mbits/s, >20 pixel chips per LPGBT • Layer 2,3,4: Pixel module with 2 x 4 pixel ASIC’s and a LPGBT • 1 E-link at 320Mbits/s per pixel ASIC • 2nd. E-link per pixel ASIC for redundancy • Layer 1: Pixel module with 1 x 4 ( Narrow module required for inner layer) • 2 E-links at 320Mbits (or one 640Mbits/s) per pixel ASIC
Sync Pixel trigger data for L0 • Fast coarse “strip” information • Fast OR along pixel region (or pixel) columns • Tracks per bunch crossing: • Layer1: 75 per chip ! • Layer4: 8 per chip • Clustered “strips” along pixel columns • Pixel region columns: 128b hit map per chip 128b x 40MHz = 5Gbits/s per chip • Useful if ~50% bits set in layer1 ? • Compression not advantageous (layer4: 8 out of 128. 8hits x 7b encoded = 56b) • Pixel columns: 512b hit map512b x 40MHz = 20Gbits/s per chip !. • Double pixel layer with Pt cut seems difficult • Feasible to put logic within pixel chip but very hard to get data out and make a viable system
Pixel trigger with ROI • L0 trigger from calo and muon within ~3us (or more). • L0 rate: 1MHz ? (assuming x10 higher rate than now) • ROI percentage: 10 % ? At ASIC level (just a guess)ROI rate: 1MHz x 0.1: 100KHz per pixel ASIC. • L1 latency: <100us (same buffer depth for L0 and L1) • ROI data: Single pixel address per track: 9 + 9 bit. • Local clustering without considering TOT. • Local logic within pixel array. • Pixel ROI data: Relative low data rate per pixel ASIC • Layer 1: 1MHz x 0.1 x (75 tracks x (9+9) + 16) = ~140Mbits/s • Layer 4: 1MHz x 0.1 x (75/10 tracks x (9+9) + 16) = ~15Mbits/s • Buffering • L0 in pixel cell • L1 in pixel cell (or EOC?) • Shared (or separate ?) buffers • Colum data busses: Shared or separate between ROI and readout ? • Readout: Shared or separated ?. • Separated: Allocate bandwidth as required by two paths • 4 links with programmable speed ( 80/160/320 Mbits/s) and programmable use (L0 ROI data or L1 readout data)
Option: trigger 1MHZ, ? x 10 us • Globally different readout/trigger option for whole experiment • All front-end electronics to be changed for all detectors.This is a serious upgrade !. • Pixel estimates • PR region buffering: Scaling with latency • ~100bits required per pixel per 10us • Limit with 65nm assuming 1/3 of digital pixel region area occupied by memories (without TRM): 10Kbits/16/100 x 10us = ~60us (possibly 100us)(assumed 50um x 50um pixels) • Readout bandwidth: Scaling with trigger rate • Layer1: ~4.5Gbits/s per pixel ASIC -> 1LPGBT per ASICPut serializer on pixel ASIC. • Layer4: ~480Mbits/s per pixel ASIC -> 1LPGBT per 4-8 ASIC • Track rate (500MHz/cm2), Cluster size (4): Reduction ?Defined very conservatively in this study. • Data reduction/compression 50% ? (TOT bits, Header, Better on-chip clustering, On-chip cluster center interpolation, ?) • More intelligent EOC logic
Flexibilities • Sensor, pixel size, signal, polarity, leakage, , • 25ns or 50ns collision spacing. • Triggered and non triggered modes • Digital corrections (timewalk, etc.) • Latency: 10 - 100us • Trigger rate: 100k or 1M • ROI mode • Test and calibration features • Scalable for different readout data rates • SEU and fault tolerance • Etc. • How many of these can we keep for how long ? • At least until specs ~clear
Optional pixel sizes with same chip 50um x 50um ASIC • ~50x50um pixel area seems feasible in 65nm • Compatible with available staggered bump-bonding • (55 x 55 to be compatible with medi/timepix sensors ?) • Flexible pixel region configuration can support multiple sensor configurations • 50 x 50 (100%), 25 x 100 (100%) • 50 x 100u (50%) • 100 x 100 (25%), 50 x 200 (25%), 100 x 200 (12.5%) • Effective pixel capacitance dominated by bump bonding, pad, shielding, re-routing and input transistor capacitance. • No major effect on pre-amp noise • Pre-amp current can be increased for larger pixels • Unused channels obviously powered off. • Will inevitably not be the best possible optimization for detectors using only a part of the available channels/resources. • Power • Unused silicon area • Bump-bonding pattern • Considering the cost ( ~2M) and efforts ( ~20 MY) required to develop/test/qualify /produce such a pixel ASIC this can be an acceptable compromise. Total budget: ~5M incl. manpower • Can some “lost” silicon area be recovered ?: • Analog: Difficult • Power off unused analog channels • Multiple thresholds per pixel ? (no gain when having TOT) • Can one parallel pre-amps to get lower noise for high capacitance? • Digital: • Power off parts of digital ? • Buffering: Increased depth as full width not required ? • Increase TOT dynamic range when using less pixel channels 25 x 100 sensor 50x 50 sensor 50 x 100 sensor 100 x 100 sensor 50 x 200 sensor 200 x 200 sensor
Work to be done for digital part • Architecture • High level System Verilog model for architecture optimization • Pixel regions • Latency buffer architecture • ROI extraction • Triggered / non triggered / other modes • Column bus bandwidth/structure • Etc. • Stimulatinghit patterns: • Constrained random hits (within system Verilog), • Full Monte Carlo simulations, • Scaled hits from current detectors, • Provoking cases • To be used for design verification framework for implementation • Detailed Implementation: Pixel region logic • Logic and circuit level • Area and power optimization • Logic type: • Static , dynamic • Synchronous , gated clocks, asynchronous • Buffers: Latch, RAM • Column busses • Clock distribution (significant part of power consumption) • SEU/yield optimization • Etc. • EOC: Readout, Control
Time schedule • Lots of work to be done to have chip ready for ~2020 upgrade • Final chip must be produced/tested: ~2018 • Final chip prototype working: ~2017 • Full first prototype: ~2016 • Small scale prototype: ~2015 • Basic test structures: ~2014 • Radiation qualification of technology: ~2014 • Pixel chip for sensor development/test ? • Dedicated 65nm chip ? • LHCb vertex pixel chip (55x55um2) • Timepix3 (55x55um2) • Other ?
Summary • 65nm seems very promising for high rate flexible pixel readout chip. • Pixel regions minimizes required digital resources and optimizes mixed signal floor-plan • Keep flexibilities/modes as long as possible. • Important global/local optimizations to be made: • Digital: Architecture (e.g. PR), logic, clocking, circuit, etc. • Analog: Preamp, disc, etc. • System: Latency, trigger rate, ROI, readout, etc. • 2020 is not so far away !