

### Advanced Implementation Techniques For Achieving Best PPA in High Performance Designs

Sudhakar Vadiraj Alur, Kishorkumar Chavada, Ramnarayan Thukkaram , Ujjwal Prakash Qualcomm India Private Limited



### Agenda

- Introduction
- Implementation challenges
- Synthesis Flow Enhancements
- Synthesis Strategy
- PnR Challenges and Solutions
- Results
- Conclusion

### Introduction

- Goal:-
  - Address congestion with reduced metal layer stack.
  - Converge on PPA targets.
  - Meet strict BBOX area requirements.
- Design Details:-
  - Subsystem ranging in Millions of instances.
  - Frequency in the GHz range.
  - Multiple power domains.
  - Multi scenario closure.



### **Implementation Challenges**





Stringent target to meet Reduced metal layer stack w/o area overhead Reduced Metal layer stack Challenges

AREA

Aggressive FMAX push to provide best in class HPC with given constraints

#### POWER

Dynamic/Leakage Power targets

4

### Synthesis Flow Enhancements









# Synthesis Flow Enhancements (Contd.)

- set\_technology –node
- set\_stage -step synthesis high\_effort\_congestion
- set\_qor\_strategy -metric timing -stage synthesis -mode balanced mega switch
- create\_placement -incr and legalize post first initial\_opto compile



### Synthesis Strategy

- Direct Congestion Driven Placement (DCDP):optimizes congestion as part of placer directly.
- Enhanced Low Power Placement (ELPP):- Switching Power Reduction During Placement.
- Destructive Advanced Logic Restructuring (DALR):-Restructure logic across hierarchies to improve QoR.
- Enhanced Delay Optimization (EDO):- Better delay optimization on Path Groups.
- Cell Filtering:- Filter high pin count cells at Initial Map to avoid congestion.
- Area Improvement:- Improve area optimization.



### Synthesis Results



- Able to improve timing and congestion issues compared to traditional flow.
- Area impact was minimal.
- Trade off on total power vs higher frequency/ routing challenges.

Qualcomm

sn



### Synth Netlist Handover to PD Team





| Baseline                           | Place<br>Congestion | Route DRC/<br>Short    | Post-Route<br>DRC/Short |
|------------------------------------|---------------------|------------------------|-------------------------|
| More ML version and<br>recipe      | Ref                 | Ref                    | Ref                     |
| Intermediate version<br>and recipe | <b>1</b> 5%         | ~10%/~80%              | ~44%/~96%               |
| Final version and recipe           | J4%                 | <mark>~83%/~90%</mark> | <mark>~96%/~99%</mark>  |



Intermediate



#### **Congestion/Area Improvement Solutions**





Qualcomm S

#### Performance

• GR-DR miscorrelation

#### - SI impact

- Lower layer routing
- CLK NDRs
- CCD offset trials
  - Early MSCTS
- Hold BW regions
- DCD degradation
- Setup-hold critical paths
- Latency improvements



|           | Post-CTS<br>Non-SI | Route<br>Non-SI | Post-CTS<br>SI | Route<br>SI |
|-----------|--------------------|-----------------|----------------|-------------|
| Setup TNS | Ref                | minor change    | Ref            | 20X 🕇       |
| Hold TNS  | Ref                | minor change    | Ref            | 3X 🕇        |



#### **Performance Improvement Solutions**

- For bridging the GR-DR miscorrelation
  - SNPS provided MLGR models for reduced metal stack
  - Via derates to mimic the DR fallout
- Hold BWs
  - DCDP across stages
  - HAB flow
  - TBC for hold buffer count reduction
- CCD refinements
  - Staggered CCD approach
  - 900% ↓TNS for these PG's
  - 89% 🖡 in FEP







| Baseline       |         |         |           |           |        |
|----------------|---------|---------|-----------|-----------|--------|
| Default Recipe | R2R TNS | R2R NVE | R2R H TNS | R2R H NVE | Util % |
| Post-CTS       | Ref1    | Ref2    | Ref3      | Ref4      | Ref5   |
| Route          | 65X     | 26X     | 16X       | 8X        | +1%    |
| Post-Route     | 0.6X    | 0.9X    | 0.3X      | 0.6X      | +5%    |
| Intermediate   |         |         |           |           |        |
| Default Recipe | R2R TNS | R2R NVE | R2R H TNS | R2R H NVE | Util % |
| Post-CTS       | Ref1    | Ref2    | Ref3      | Ref4      | Ref5   |
| Route          | 22X     | 11X     | 12X       | 5X        | +0.5%  |
| Post-Route     | 0.25X   | 0.6X    | 0.25X     | 0.37X     | +3%    |
| Final          |         |         |           |           |        |
| Default Recipe | R2R TNS | R2R NVE | R2R H TNS | R2R H NVE |        |
| Post-CTS       | Ref1    | Ref2    | Ref3      | Ref4      | Ref5   |
| Route          | 10X     | 5X      | 1X        | 22X       | 0%     |
| Post-Route     | 0.15X   | 0.30%   | 0.7X      | 0.7X      | +1%    |
|                |         |         |           |           |        |

## Performance Improvement Solutions (Contd.)

- DCD improvements
  - cts.common.prefer\_inverter\_for\_delay\_insertion true
  - cts.compile.repeater\_selection inverters\_only
- Setup-hold critical paths
  - Two pass clock-opt ; SNPS tbc
  - Manual ECO for left over paths.
- Latency improvement
  - Flex H-tree based MSCTS
  - split\_clock\_cells -latency\_driven -cells [ get\_cells a1]
  - − 10% ↓ in latency & 15% ↓ in skew
  - 40% improvement in hold TNS.



#### **Power**

- Synth -> Post-Route
  - ~15% increase in logic power Dyn power
  - ~44% increase in Lkg power
- Power critical cores are multi instantiated. •
- Reduced layer shooting up congestion and timing • violations; in turn tool upsizing and excessive buffer for fixing violations
- **GR-DR** miscorrelation •
- High SI impact from GR->DR •



20%

0%

-5%

Qualcomm

Final Synth to PRO growth

sn

|                | Ref Synth to<br>PRO growth | Initial Synth to<br>PRO growth | Final Synth to<br>PRO growth |
|----------------|----------------------------|--------------------------------|------------------------------|
| Register Power | -5%                        | -3%                            | -5%                          |
| Combo Power    | -6%                        | 10%                            | 3%                           |
| Logic Power    | -1%                        | 15%                            | 7%                           |

Initial Synth to PRO growth

Ref Synth to PRO growth

#### **Power Improvement Solutions**

- Synth stage:
  - EIO/EDO
- Place stage:
  - InDesign PrimePower
    - Generate refreshed Saif using input FSDB's
    - Generated Saif read back in FC with accurate switching activity
    - Two pass Final Place & Final opto
- PCO stage:
  - Two pass clock\_opt
- Post Route stage:
  - Lower DCVS FMAX Relaxation





#### **Power Improvement Results**



| Recipe                      | % Dynamic<br>Improvement | Std cell<br>area | DRC                | Leakage |
|-----------------------------|--------------------------|------------------|--------------------|---------|
| EIO/EDO                     | 0.60%                    | Same             | NA                 | 4%      |
| IDPP                        | 1.50%                    | Same             | NA                 | -6.50%  |
| Two pass final Place & opto | 0.02%                    | -0.28%           | Same               | -2.30%  |
| 2Pass Clock_opt             | 1.10%                    | -0.12%           | Same               | -2.80%  |
| FMAX Reduction at PRO       | <mark>2.60%</mark>       | <mark>-3%</mark> | <mark>-415%</mark> | Same    |
| Total                       | <mark>7%</mark>          |                  |                    |         |

### **Final Results**



- Timing(TNS) in enhanced flow was converged but traditional flow in post route tool does not complete.
- Stdcell Area is better in Enhanced flow.
- Design not routable with Traditional flow, was route-able in enhanced flow.





### Conclusion

- Achieved our increased target frequency with reduced metal layers.
- With congestion improvement database was routable.
- This flow is now default for all high-performance cores trying to achieve higher frequency.





## THANK YOU

Our Technology, Your Innovation<sup>™</sup>

#### Congestion/Area

Place

- Hold aware budgeting with selective hold critical scenarios
- set\_qor\_strategy -metric total\_power -mode balanced
- AWP CCS
- •-set\_app\_options -name time.enable\_ccs\_rcv\_cap -value true
- set\_app\_options -name time.delay\_calc\_waveform\_analysis\_mode value full\_design
- Custom values for place.coarse.experimental\_delay\_weight

- DCDP

- Careful selection of DTDP scenario after lot of trials
- Ultra congestion effort
- ADC true & initial\_place options
- Custom values for place.coarse.congestion\_driven\_max\_util
- Custom values for place.coarse.target\_routing\_density
- Place\_opt ccd
- place\_opt.initial\_drc.global\_route\_based (for notch congestion)

Qualcomm



#### **Post-CTS onwards**

- MLGR New model + via derates all corners + hold\_inst tbc
- Set\_qor\_strategy timing
- set\_app\_options -name opt.common.power\_flow -value preserve\_area