

# Clock-in early: Integrated Structural Multisource Clock Tree Synthesis to push PPA in high-speed GPU designs

Presenter: Amit Dounde

Co-authors: Francisco Rivera Valverde, Huy Cao,

Milind Mahajan, Dhananjay Kewale

Intel

### Agenda









What/Why/How?



Early clock-in features



Results & Recommendations

# Common Acronyms



| Abbreviation | Definition                 |
|--------------|----------------------------|
| CTS          | Clock Tree Synthesis       |
| MSCTS        | Multisource CTS Flow       |
| SMSCTS       | Structural Multisource CTS |
| RMSCTS       | Regular Multisource CTS    |
| ISMSCTS      | Integrated SMSCTS          |
| IRMSCTS      | Integrated RMSCTS          |

SNUG INDIA 2024





# Multisource Clock Tree Synthesis



- On-chip-variation (OCV) motivates advanced clock techniques to achieve lower skew, improved shared clock paths and other PPA targets.
- MSCTS consists of two components:
  - Global clock distribution: clock mesh, spine, ...
  - Local clock distribution: subtree's that connect the global network to the sinks
- We focus on local clock distribution techniques!



## Local Clock Tree Synthesis



 There are 2 main mechanisms in which local clock trees are created in MSCTS:

#### **Structural MSCTS (SMSCTS)**



#### Regular MSCTS (RMSCTS)





#### Structural MSCTS

snug

- Optimize for latency and skew, while minimizing clock power
- Consists of:
  - Balance: add cells to level subtrees
  - Merge: merges equivalent clock cells
  - Optimize: splitting, sizing, relocation of clock cells
  - Route
  - Refine: sizing and cell relocation
- Typically, less OCV, better skew but slightly more demanding flow/power





## Regular MSCTS



- Sink distributed among tap drivers during tap assignment
- Regular CTS is used for building the subtrees
  - No restriction on adding/removing buffers
- Lesser power than SMSCTS, very useful skew friendly



### A better way to do Structural MSCTS



- RMSCTS "simplicity" gives the tool a lot of flexibility to:
  - Enable compile\_fusion clock-aware optimizations better clock gate optimization, concurrent clock/data
     (CCD) opt => Integrated Regular MSCTS
  - Easier useful skew implementation as there is no restriction on adding/removing buffers on subtrees



 Traditionally, SMSCTS clock trees are frozen during build clock – CCD mainly used for sizing/relocation of clock cells

Push/pull can be done on a post-CTS optimized database also, but limited room for CCD as compared

to full flow enabling



#### **Presentation Goal**

 Enable early SMSCTS clock construction during compile\_fusion to take advantage of clock-aware optimization features











# What is Integrated SMSCTS? (Traditional)



**Traditional** 



- Clock cell insertion (balancing, splitting, DRC fixing, ...) happens until clock\_opt
- After new cell insertion, clock network is routed
- Refine optimization are performed, with post clock data optimization as well (clock\_opt –from final\_opto)

### What is Integrated SMSCTS?



**Traditional** 



**Integrated** 



- Moved clock cell creation (balancing, merging, splitting,...) to final\_place.
- CTS performs incremental skew optimization, routing and post route opt
- Clocks are not propagated, but cell presence enables clock-aware optimization



### Why we do ISMSCTS?



#### Goals

- 1. (**ISMSCTS**) Leverage the latest fusion\_compiler enhancements that enable compile fusion to create the SMSCTS clock subtree structure logically
- 2. (CCD) Enable useful skew computation leveraging the already inserted clock cells
- 3. (ECGL) Enable estimate clock gate latency to have a better optimization perspective on enable paths



- Cts network inserted upfront great for clock cell prioritization (placement)
  - Clock delay not propagated, expected in future releases

Full flow CCD has more opportunity to optimize



Realistic clock gate picture enal





## **Traditional** post compile SMSCTS



Design Init (elab, analyze, constrain,...)

compile\_fusion -to initial\_opto

compile\_fusion -from final\_place

source cts\_settings.tcl

clock\_opt -from build\_clock -to route\_clock

clock\_opt -from final\_opt

route\_opt

- SMSCTS settings make sense only on a post compile/place\_opt context
- Clock network is constructed only after compile\_fusion/place\_opt is done (build\_clock).
- Typically, CCD is enabled only on a post-CTS fashion (from final\_opt)

# Feature#1: Integrated SMSCTS (Precompile) snug



Design Init (elab, analyze, constrain,...)

source cts\_settings.tcl

compile\_fusion -to initial\_opt

compile\_fusion -from final\_place

- Provide CTS settings before compile\_fusion final\_place
  - Exceptions, balance points, clock cells to use
  - set\_multisource\_clock\_subtree\_options
  - set\_app\_options -name compile.flow.enable\_multisource\_clock\_trees value true
- Make sure no user don't touch on clock network
- Compile\_fusion will trigger Trial CTS where clock cells are physically inserted in the DB and placed
  - No routing happens on subtree at this stage\*\*

# Feature#1: Integrated SMSCTS (Trial CTS)



- New Trial CTS stage will be executed during compile fusion final\_place
  - Watch for CTS-\* messages which traditionally would have been in clock\_opt command

compile\_fusion -from final\_place

```
Information: Starting compile_fusion / final_place (FLW-8000)
Information: Starting compile_fusion / final_place / Trial CTS (1) (FLW-8000)
Compile-fusion optimization trialctsups Item 1 123456.00 7891011.00 12131415.16 17181920.25
Information: Ending compile_fusion / final_place / Trial CTS (1) (FLW-8001)
Information: Starting compile_fusion / final_place / Optimization (FLW-8000)
Compile-fusion command begin CPU: 3123 s ( 123 hr ) ELAPSE: 123456 s ( 78.910 hr ) MEM-PEAK: 1 MB
Compile-fusion timing update complete CPU: 3123 s ( 123 hr ) ELAPSE: 123456 s ( 78.910 hr ) MEM-PEAK: 1 MB
Information: Running multisource subtree synthesis for preprocess and merge. (CTS-781)
Information: CTS will work on the following scenarios. (CTS-101)
```

### Feature#1: ISMSCTS (Postcompile)



 Keep SMSCTS settings same as the ones used in pre-compile

- Clock\_opt will understand the already created logical clock network
  - Will only perform incremental optimizations
  - Route and refine clock network
  - This typically allows a better skew as compared to non Integrated flow

 Route-aware optimization on clock network happens on this stage

clock\_opt -from build\_clock -to route\_clock

clock\_opt -from final\_opt

route\_opt

### Feature#2: Full Flow CCD (Enable)



- CCD can be enabled via app options:
  - set\_app\_options -name compile.flow.enable\_ccd -value true
  - set\_app\_options -name clock\_opt.flow.enable\_ccd -value true
  - set\_app\_options -name route\_opt.flow.enable\_ccd -value true



# Feature#2: Full Flow CCD (level control)





• SMSCTS tree target level (set\_multisource\_clock\_subtree\_options -target\_level X) can be different than **CCD target levels** (set\_clock\_tree\_options -max\_levels Y), where Y > X.



Because ISMSCTS preserve subtree level structure, special handshaking is required from the tool perspective to physically implement offsets SNUG INDIA 2024

24

# Feature#2: Full Flow CCD (handshake)



- In ISMSCTS context, following options are required so compile fusion is aware of the subtrees while computing offsets
  - set\_app\_options -name cts.multisource.enable\_subtree\_synthesis\_aware\_ccd -value true
  - set\_app\_options -name\_clock\_opt.flow.enable\_multisource\_clock\_trees -value true

- The specified offsets must be realistically implementable during clock\_opt build tree:
  - set\_app\_options -name ccd.max\_prepone -value <>
  - set\_app\_options -name ccd.max\_postpone -value <>

# Feature#3: Native clock splitting



- With ISMSCTS, it is advisable to turn on native CG-enable logic duplication feature:
  - set\_app\_options -name cts.multisource.enhanced\_enable\_logic\_splitting -value true
  - set\_multisource\_clock\_subtree\_options -split\_enable\_max\_width <> split\_enable\_max\_height <> -split\_enable\_max\_latch\_level <> -split\_enable\_max\_fanout <>
- With realistic clock splitting, it is advisable to enable ECGL feature to help enable path closure on clock gates
  - set\_app\_options -name opt.common.estimate\_clock\_gate\_latency -value true





#### Context of the blocks used



 Used Fusion Compiler<sup>™</sup> tool version U-2022, highly tuned flow recipe for best PPA

#### Blocks details:

- ~1 million instances blocks
- highly congested blocks (both wire and placement)
- 70% > utilization
- Graphics designs with many datapaths, and muxing
- Non-macro dominated

## **Experiment Setup**



- Data is always presented after route\_opt, with fill-insertion, and apples-to-apples comparison
- **Baseline** for all experiments is a traditional SMSCTS flow, with CCD size only enabled in post cts.
- Target experiments use the best recommendations from before: ECGL, CCD since final\_place, ISMSCTS
- We also provide equivalent IRMSCTS results



### Highlights of the feature: TNS





- ISMSCTS+features provide a significant improvement as compared to baseline
  - Some blocks are still better with IRMSCTS
- Min convergence in general is more controlled in Structural flow impacting overall optimization
- TNS reduction can be mapped to ease of convergence:
  - Less congested/buffer polluted designs at route\_opt
  - Lesser violations means lesser effort for PR/manual work

### Highlights of the feature: WNS





- Early CCD has been a great enabler of frequency push:
  - For non-RTL bottle-necks, WNS can be reduced with this technique
  - Late CCD (both sizing or incremental post-CTS push/pull) has restricted opportunities to push freq-max

## Room for improvement



- With CCD, min is typically degraded (up to 20% degradation on ISMSCTS and 40% on IRMSCTS)
  - CCD hold effort high might be needed for high hold degradation. This may have an impact on leakage power.
- With CCD, ECGL and ISMSCTS, runtime can increase up to 10%
- ISMSCTS had ISO power/area as compared to baseline
  - IRMSCTS had slightly better total power (<5%) than ISMSCTS</li>
- We observe maximum 2 extra levels added by CCD on ISMSCTS (5% more clock buffers than baseline) and ISO clock gates

#### Conclusions and future work



- ISMSCTS flow is the recommended approach for designers to converge their block
  - For all high frequency blocks, ISMSCTS gives better PPA compared to IRMSCTS or SMSCTS.
  - Some blocks (mostly lower frequency) may converge better with regular MSCTS as compared to structural MSCTS with CCD.
- How to improve from here
  - ISMSCTS happens in final\_place discussion in progress to analyze Trial CTS on initial\_opto
  - Clock propagation on compile\_fusion could potentially improve total setup cost









## Concurrent clock and data Opto









- This feature can improve design power along with timing QoR improvements
- Compile\_fusion can estimate offsets saving area/power
- Clock build\_clock push/pull clocks for better timing
- Clock final\_opto/route\_opt perform incremental CCD optimizations

Source: Concurrent Clock and Data, SNUG 2018

#### **ECGL**



At pre-CTS stages we don't have correct clock gating latencies:



 Tool can estimate ICG downstream clock delay after specifying set\_multisource\_clock\_subtree\_options and ensuring opt.common.estimate\_clock\_gate\_latency is true (OBD)



# Feature#2: Full Flow CCD (Recommendation)



- Many CCD app options to control flow are available
  - report\_app\_options \*ccd\*
- For a successful CCD implementation, consider:
  - If setup critical design: set\_app\_options -name ccd.fmax\_optimization\_effort -value high
  - CCD is path group oriented, make sure both ccd.skip\_path\_groups and ccd.targeted\_ccd\_path\_groups are properly set
  - If CCD hold degradation is not acceptable, consider changing value of ccd.hold\_control\_effort

# Feature#2: Full Flow CCD (Hierarchical)



- Hierarchical convergence can be taxing with CCD enabled.
- FC supports different boundary CCD options:
  - No ccd on boundary (ccd.optimize\_boundary\_timing)
  - Skew targets for boundary paths (recommended value is global skew pre-ccd)
    - set\_app\_options -name ccd.skew\_opt\_input\_boundary\_max\_prepone -<>
    - set\_app\_options -name ccd.skew\_opt\_input\_boundary\_max\_postpone -value <>
    - set\_app\_options –name ccd.skew\_opt\_output\_boundary\_max\_prepone -value <>
    - set\_app\_options –name ccd.skew\_opt\_output\_boundary\_max\_postpone -value <>