

### High Speed Cores Design convergence on Intel's 20A Process

Kamal Deep Rajput, Parbhat Kangra Intel



### Design SPEC And Sensitivity Block Complexity

### **DESIGN SPEC**

**Block Complexity** 

- General Specifications
  - Design : Block A
  - High Speed Design
  - Process Node: Intel 20A
  - Instance Count: 1.3 million+
  - Complex Memories: 200+ memories with transparencies inside
  - Clock Tree Synthesis (CTS): Multipoint CTS with multiple tap points
- FC Flow Runtime:
  - Total APR runtime from import design till route\_opt : ~127 hrs (5+ days)



## DESIGN SENSITIVITY

### **Block Complexity**

- Below table shows the sensitivity of the design w.r.t clock cycle time
- ~6.6% delta in the Clock Freq was leading to shift of ~92% TNS reg2reg, ~83% in2reg, and ~98% reg2out bus impacting other partitions
- Disclaimer : Used normalized no.'s below for reference
  - P<sup>int</sup> = 20000
  - T<sup>int</sup> = 400 ns
  - $\circ$  P<sup>ext</sup> = 10000
  - T<sup>ext</sup> = 300 ns
  - Crit\_Reg\_Out = 150 ns
  - Crit\_A2B = 200 ns

| corner          | r #paths (int) TNS (int) |                 | #paths (ext) TNS (ext) |                | Crit_reg2out TNS          | Path from Block 'A' impacting Block 'B' |  |
|-----------------|--------------------------|-----------------|------------------------|----------------|---------------------------|-----------------------------------------|--|
|                 |                          |                 |                        |                |                           | (Interface) TNS                         |  |
| max_high,CT     | 20000 (P_int)            | 400ns (T_int)   | 10000 (T_ext)          | 300ns (T_ext)  | 150ns (Crit_Reg_Out)      | 200 ns (Crit_A2B)                       |  |
| max_high,1.06CT | 2k (10%Pint)             | 36ns ( 8%Tint ) | 2200 ( 22%Pext )       | 54ns (17%Text) | 16.5 ns (11%Crit_Reg_Out) | 4ns ( 2% Crit_A2B )                     |  |





### Clock Design Strategy Latency, Power and Timing



Clock tuning experiments to improve latency, timing, and power :

• The following experiments were done to achieve the optimal clock settings needed to improve the design :

| Run Name | Cell type                     | dg allowed                        | CTS primary<br>Corner | tran limit | max<br>repeater | cts Max<br>fan out | metal layer<br>allowed      | Set Min_cap at |    | Med Lat | Max Lat<br>(max high %CT) | Min Lat<br>(max_high %CT) |
|----------|-------------------------------|-----------------------------------|-----------------------|------------|-----------------|--------------------|-----------------------------|----------------|----|---------|---------------------------|---------------------------|
|          |                               |                                   | conner                |            | repeater        | run out            | unoweu                      |                |    |         |                           | (mux_mgn //er/            |
| RUN1     | all buf/inv                   | all dg used                       | max_nom               | 0.75(Y/Z)  |                 | no                 | Upto<br>Top-2 layer         |                | 47 | 57      | 80                        | 32                        |
| RUN2     | Selected inv/buf<br>templates | mid-range dg                      | max_high              | Y/Z        | 4               | no                 | Upto<br>Top-2 layer         |                | 41 | 56      | 72                        | 31                        |
| RUN3     | Selected inv/buf<br>templates | mid-range dg                      | max_high              | Y/Z        | 4               | no                 | Opened Top 2<br>layers also |                | 51 | 53      | 77                        | 25                        |
| RUN4     | Selected inv/buf<br>templates | mid-range dg                      | max_high              | Y/Z        | 4               | no                 | Opened Top 2<br>layers also | yes            | 50 | 54      | 76                        | 27                        |
| RUN5     | Selected inv/buf<br>templates | Mid-range @ CTS,<br>low+mid @ CRO | max_high              | 1.1(Y/Z)   |                 | 42                 | Opened Top 2<br>layers also |                | 45 | 57      | 77                        | 31                        |
| RUN6     | Selected inv/buf<br>templates | Mid-range @ CTS,<br>low+mid @ CRO | max_high              | Y/Z        | 4               | no                 | Opened Top 2<br>layers also |                | 48 | 56      | 76                        | 28                        |

Clock tuning experiments to improve latency, timing, and power :

- Normalized margin graph:
- **RUN3** (in yellow) is shifted towards right w.r.t the reference shows better slack values



Slack



Clock tuning experiments to improve latency, timing, and power :

- Clock skew graph:
- **RUN3** (in yellow) is shifted towards right w.r.t the reference shows more +ve clock skew for more number of paths which helped improving the timing QoR



**Clock skew** 



Clock tuning experiments to improve latency, timing, and power :

- Total Delay graph:
- **RUN3** (in yellow) is shifted towards right w.r.t the reference shows delay increased on datapath which helped in power improvement of the block 'A' without impacting timing QoR





Clock tuning experiments to improve latency, timing, and power :

• The following experiments were done to achieve the optimal clock settings needed to improve the design :

|          |   | Primetime     |             |                       |           | РТРХ      |         | Database | cell count            | TOTAL                 |
|----------|---|---------------|-------------|-----------------------|-----------|-----------|---------|----------|-----------------------|-----------------------|
|          |   | Total (setup) | R2R (Setup) | In2reg                | R>R(Hold) | Total pur |         | Ref      | No' of clock cell     | N <sup>total</sup>    |
| Run Name | 2 | (max_high)    | (max_high)  | (setup)<br>(max_high) | (max_nom) | Total pwr | clk pwr |          | No' of clock_inverter | N <sup>inv</sup>      |
|          |   |               |             |                       |           |           |         |          | No' of clock_buffer   | N <sup>buf</sup>      |
| RUN1     |   | 1             | 1           | 1                     | 1         | 1         | 1       |          |                       |                       |
| RUN2     |   | 0.8           | 0.8         | 1.2                   | 1.4       | 0.98      | 0.95    | Database | cell count            | TOTAL                 |
| RUN3     |   | 1.0           | 0.3         | 2.6                   | 1.4       | 0.97      | 0.92    | Test     | No' of clock cell     | 0.9N <sup>total</sup> |
| RUN4     |   | 1.3           | 0.7         | 3.0                   | 1.4       | 0.96      | 0.92    |          | No' of clock_inverter | 0.8N <sup>inv</sup>   |
| RUN5     |   | 0.9           | 0.7         | 1.4                   | 0.2       | 0.96      | 0.87    |          |                       | 0.86N <sup>buf</sup>  |
| RUN6     |   | 1.0           | 0.7         | 1.4                   | 1.5       | 0.97      | 0.90    |          | No' of clock_buffer   | 0.86N                 |

- RUN3 QoR :
  - Enabling Top 2 layers for clock NDR increased clock net metal usage from ~1% to ~5% for those layers
  - better median latency w.r.t reference
  - It's a trade-off between median latency improvement and in2reg setup paths which was handled with targeted clock-pushes in ECO phase
  - ~3% Total power gain and 8% clock power gain
  - Clock tree is inverter cell dominated
- While RUN5 recipe had better TNS and power but excess relaxation of maxcap and trans resulted in sign off logical DRC violations







CTS settings to improve timing and meet target latency:

- **CTS**:
  - Clock fine-tuning done to reach best on latencies based on criticality and validated during CTS to ensure less loop and least manual effort
  - Challenges in clock tuning :

Problem statement :

- Some of the critical in2reg paths latency was on lower side, while reg2out paths latency was on higher side
- The median latency is high and there is a significant increase in median latency from the CTS stage to CRO
- The balance point offset values due to CCD at compile was causing many iterations to settle on the required latency on critical paths

### Clock tuning settings used :

 To control the jump of latency from cts to clock\_route\_opt used below setting in CRO, set\_app\_options -name clock\_opt.flow.enable\_ccd\_clock\_drc\_fixing -value always\_off set\_app\_options -name ccd.max\_postpone -value \$limit set\_app\_options -name ccd.max\_prepone -value \$limit set\_app\_options -list {opt.common.hold\_fixing\_setup\_margin \$value}

CTS settings to improve timing and meet target latency:

### • **CTS**:

- **Clock tuning settings used :** Steps followed to meet latency requirement for critical family of registers:
  - Below settings are common for all scenarios of clock tuning
    - Disable CCD optimization to critical family of registers from compile stage which resulted in '0' offset values as well set\_app\_options -name ccd.respect\_cts\_fixed\_balance\_pins -value true set\_attribute [get\_flat\_pins \$target\_register/clk\*] -name cts\_fixed\_balance\_pin -value true
    - For clock pushing

set\_clock\_balance\_points -clock [get\_clocks \$clk\_name] -delay -\$push -balance\_points \$target\_clk\_pin —scenarios \$scenario

 For clock pulling , created skew groups for different families of sequential create\_clock\_skew\_group -name \$\$ skew\_gp\_name -objects \$target\_clock\_pin -clock \$clk\_name set\_clock\_balance\_points -clock [get\_clocks \$clk\_name] -delay \$pull -balance\_points \$target\_clk\_pin -scenarios \$scenarios

Fine tunning of latencies was done w.r.t the median latency of the design.

Creating skew group resulted in latencies purely based on placement of the registers from tap drivers and not according to any other register out of this particular skew group. It was iterative process to decide on amount of push/pull for a skew group.

CTS settings to improve timing and meet target latency:

- **CTS**:
  - Clock Latency :
    - FC requirement ,
      - a) Critical in2reg family : ~65%CT
      - b) Critical reg2out family : ~47%CT
  - Latencies achieved through construction :

| Median latency | Crit in2reg family | Latency |
|----------------|--------------------|---------|
| 55%CT          | Group 1            | 67%CT   |
|                | Group 2            | 72%CT   |
|                | Group 3            | 69%CT   |
|                | Group 4            | 73%CT   |
|                |                    |         |





# **Placement & Routing Challenges**

### Placement & Routing Challenge

- snug
- We have timing paths going through 1k+ latches (in red) in transparency going through memories (in yellow) and ending at interface ports, as shown below,



### Placement & Routing Challenge

- Challenges :
  - Jogs and lower metal layer usage for longer nets
  - Least level of repeaters (1-2)
  - Critical Seq alignment
    - Placement of the latches were not aligned as per the memory input pins
  - Crosstalk on good/straight routes because of no NDR



- **Target** : quality right from construction
  - Bounding logic : bounded 1k+ seq in the vertical channels based on location of fanin cone
  - Decision on repeater count
  - Optimized placement of seq and repeaters aligned with memory input pins
  - NDR to support crosstalk , congestion , and timing at same time
    - a. Routing guide with 50% partial percentage in that region
    - b. Adding routing blockages on alternate tracks
  - Achieved targeted results up to a certain level by above strategy and remaining were done by custom pre-route

### Placement & Routing Challenge



### • Custom Pre-route Strategy :

- Drawn straight routes considering memory input pins as reference
- NDR and one repeater on the route till the sequential
- Different horizontal metal layers were used based on location to reduce congestion in one layer
- Big size buffer was used to take full advantage of NDR with higher metal layer otherwise splitting it into inverter pair increased via res
- Decided the optimal placement of the seq (in red) and its repeater (in blue)
- Did via strapping also on repeater to reduce RC as much as possible

| Distance   | No. of<br>buffers | Metal layer/NDR   |  |  |
|------------|-------------------|-------------------|--|--|
| '6.25A' um | 1                 | Top Layer/NDR     |  |  |
| '4.6A' um  | 1                 | Top Layer - 2/NDR |  |  |
| '2.7A' um  | 1                 | Top Layer - 4/NDR |  |  |
| 'A' um     | 0                 | Top Layer - 6/NDR |  |  |

• Achieved RC improvement ~37% over reference DB





# Time Borrow, Pre-buffering, & Slack Based NDR

### Time Borrow, Pre-Buffering, & Slack Based NDR



- Time Borrow :
  - Time Borrow limit on transparency paths through critical sequential gave ~5%CT (average) gain in datapath delay by better optimization and reducing the number of levels
- Slack Based NDR:
  - ~4%CT gain (average) in RC delay

| FC Slack<br>(ps)                   | vr length<br>(um)         | Metal layer/priority          | Pre-buffering |
|------------------------------------|---------------------------|-------------------------------|---------------|
| <b>M</b> < worst slack             | L > Max-limit             | Top Layer/1 <sup>st</sup>     | Yes           |
| <b>M</b> < worst slack             | Max-limit > L > Mid-limit | Top Layer – 2/2 <sup>nd</sup> | No            |
| Mid slack > <b>M</b> > worst slack | L > Max-limit             | Top Layer/3 <sup>rd</sup>     | Yes           |
| Mid slack > <b>M</b> > worst slack | Max-limit > L > Mid-limit | Top Layer - 2/4 <sup>th</sup> | No            |



# Initial vs Final Timing QoR

### Initial vs Final Timing QoR

snug

- Crit\_reg2out TNS improvement ~96%
- Full Chip TNS improvement ~93%
- Trade-off between median latency improvement and in2reg setup paths which was handled with targeted clock-pushes in ECO phase
- FC timing comparison of Initial vs Final APR DB of block 'A' : comparison is done with same constraint

Where, T<sup>ext</sup> << T<sup>int</sup>

| DB Name      | corner        |          | TNS                  |          | TNS                  |
|--------------|---------------|----------|----------------------|----------|----------------------|
| Initial DB   | max_high      | External | T <sup>ext</sup>     | Internal | T <sup>int</sup>     |
| Final APR DB | max_high      | External | 2.0T <sup>ext</sup>  | Internal | 0.02T <sup>int</sup> |
|              |               |          |                      |          |                      |
| Initial DB   | max_nom       | External | T <sup>ext</sup>     | Internal | T <sup>int</sup>     |
| Final APR DB | max_nom       | External | 2.0T <sup>ext</sup>  | Internal | 0.01T <sup>int</sup> |
|              |               |          |                      |          |                      |
| Initial DB   | min_pfff_high | External | T <sup>ext</sup>     | Internal | T <sup>int</sup>     |
| Final APR DB | min_pfff_high | External | 0.04T <sup>ext</sup> | Internal | 0.6T <sup>int</sup>  |
|              |               |          |                      |          |                      |
| Initial DB   | min_low_cold  | External | T <sup>ext</sup>     | Internal | T <sup>int</sup>     |
| Final APR DB | min_low_cold  | External | 0.09T <sup>ext</sup> | Internal | 1.6T <sup>int</sup>  |

• Full Chip TNS comparison:

| Corner<br>(max_high) | ext_tns              | int_tns              | crit_reg2out_tns              | crit_in2reg_tns              | FC TNS              |
|----------------------|----------------------|----------------------|-------------------------------|------------------------------|---------------------|
| Initial DB           | T <sup>ext</sup>     | T <sup>int</sup>     | Tcrit_reg2out                 | T <sup>crit_in2reg</sup>     | TFC                 |
| Final APR DB         | 0.09T <sup>ext</sup> | 0.04T <sup>int</sup> | 0.04T <sup>crit_reg2out</sup> | 1.25T <sup>crit_in2reg</sup> | 0.07T <sup>FC</sup> |



# THANK YOU

Our Technology, Your Innovation<sup>™</sup>