

## Alleviate Crosstalk and Congestion Challenges with Automated Full Crossbar and Memory Channel Implementation

Naman Gupta Harsha K. Ranjan Yash Gehlot Sr. Staff Engineer Staff Engineer SiFive SiFive

Vaibhav Gupta Warren Lew Sr. Staff FAE Director SiFive SiFive

**Senior Engineer** SiFive

Shingo IBA **R&D** Engineering, Sr. Staff Engineer Synopsys

### Synopsys + SiFive – Broad and Deep Partnership



• Co-optimize Tools, Flows, Methodologies, and Cores

- Synopsys access to SiFive IP, including cores under development
- Executive sponsorship and synchronization





Agenda







## SiFive Intro and Processor IP Portfolio



## SiFive is the leader of the RISC-V era

- SiFive is first-to-market with RISC-V features, standards, and technologies
- Broadest RISC-V portfolio enables highperformance computing for both AI (NPU) and general purpose CPUs
- Full system-level solutions with security, coherent interconnect, IOMMU and D2D
- SiFive solutions are silicon proven with more than 350 design wins in consumer, infra. and auto.
- Extensive software offering & partner ecosystem







# design wins

Multiple Top 10 Semi manufacturers Multiple Top 5 Robotaxi manufacturers Multiple Top 5 US Datacenter & Storage suppliers Multiple Top Datacenter vendors in Asia Publicly Listed Chinese Car OEM Multiple A&D Prime Contractors

## SiFive broad IP portfolio

Scalable from MCU to high-performance compute



|                                                                 | Intelligence                                          |                                                         |                                                                                         |                                                                                                                                     | Performance                                                                          |                                                                                                                                            |                                                                                                                                                                    |
|-----------------------------------------------------------------|-------------------------------------------------------|---------------------------------------------------------|-----------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <b>X200-Series</b><br>512-bit VLEN<br>Single Vector ALU<br>VCIX | Up to                                                 | <b>0-Series</b><br>1024-bit VLEN<br>Ə / Dual Vector ALU | <b>P200-Series</b><br>2-wide in-order core<br>256b vector length<br>WorldGuard<br>RVA20 | P400-Series<br>3-wide OoO core<br>128b vector length<br>Hypervisor extension<br>Vector crypto<br>IOMMU & AIA<br>WorldGuard<br>RVA22 | <b>P500-Series</b><br>3-wide OoO core<br>Hypervisor extension<br>WorldGuard<br>RVA20 | <b>P600-Series</b><br>4-wide OoO core<br>128b vector length<br>Hypervisor extension<br>Vector crypto<br>IOMMU & AIA<br>WorldGuard<br>RVA22 | <b>P800-Series</b><br>6-wide OoO core<br>128b vector length<br>Hypervisor extension<br>Vector crypto<br>IOMMU & AIA<br>WorldGuard<br>Shared cluster cache<br>RVA23 |
|                                                                 |                                                       |                                                         |                                                                                         |                                                                                                                                     |                                                                                      |                                                                                                                                            |                                                                                                                                                                    |
|                                                                 | Essential                                             |                                                         |                                                                                         |                                                                                                                                     | Automotive                                                                           |                                                                                                                                            |                                                                                                                                                                    |
|                                                                 | Essential<br>U6-Series<br>64-bit, high<br>performance | <b>U7-Series</b><br>64-bit, superscalar<br>performance  | <b>E6-A</b><br>32-bit, balanced<br>performance and<br>efficiency ASIL B, D              | <b>S7-A</b><br>64-bit, high<br>performance<br>embedded A                                                                            | <b>X280</b><br>512-bi<br>Single                                                      |                                                                                                                                            | <b>P870-A</b><br>6-wide OoO core<br>128b vector length<br>Hypervisor extension                                                                                     |
| <b>S2-Series</b><br>64-bit, Area<br>optimized                   | <b>U6-Series</b><br>64-bit, high                      | 64-bit, superscalar                                     | 32-bit, balanced                                                                        | 64-bit, high                                                                                                                        | <b>X280</b><br>512-bi<br>Single                                                      | t VLEN                                                                                                                                     | 6-wide OoO core                                                                                                                                                    |

## **SiFive Consumer platform**



- Heterogeneous architecture
- Superior performance efficiency
- System-level solution
- Optimized for Android







## Implementation Challenges for MC16

# Design Details

| items             | Value                                       |
|-------------------|---------------------------------------------|
| RTL Configuration | MC16 / SiFive_CoreIPSubsystem               |
| Frequency goal    | 1.xx GHz                                    |
| Design Details    | 3~4M inst. 16xP670 cores and 32 MB L3 Cache |
| FC version        | T-SP5 20230424                              |





#### **Challenges**

- Large crosstalk in post route.
- Register placement control for pipeline structure.
- Minimize user implementation effort and design iterations.

# Initial Results: Large Crosstalk in Post-Route synopsys

Initial Results:

- Large cross talk overserved in highest layers
- Preroute stage layer assignment is not causing heavy congestion. However, high route density cause large xtalk in post route even with track SI opt.
- SiFive used dummy PG for preventing excessive layer promotion in early phase

#### route\_opt net with crosstalk > 5ps is highlighted



#### Track availability

|                     | Available Resources | route_opt Results |
|---------------------|---------------------|-------------------|
| Middle Layer        | 6K                  | 1.6K              |
|                     | 6K                  | 2.1K              |
| Intermediate Layers | 6K                  | 2.1K              |
|                     | 6K                  | 1.2K              |
| High Layer          | 3.2K tracks         | 2.3K tracks       |

## **Repeater Spreading**



Repeater spreading give spacing between repeater paths in channel region.



#### Results with repeater spreading tuning



## Enable Wire-opt Improvement (EWI)



- Wire delay dominates over cell delay in advanced nodes.
- Wire-opt utilizes opportunity to reduce wire delay on critical nets through effective usage of ultra-fast low-resistance upper layers and NDRs.
- During timing optimization, critical nets are selectively chosen for layer promotion & NDR application together with re-buffering.
  - Natively integrated into core optimization engines with accurate costing of timing & congestion impact
- Tool evaluates multiple combinations of layer promotion & different types of NDRs with or without rebuffering.
  - NDR can be spacing (3s) or width + spacing (2w2.5s) both possibilities are considered (latter needs extra app option)
- Global & detailed routing engines honor the guidance from pre-route while selectively ignoring un-realizable constraints to ensure best congestion & DRCs.

## Enable Wire-opt Improvement (EWI)





# MC16 Compile Result with Spread Repeaters siFive synopsys (cell placement)

Baseline

Test-run with spread repeaters Path is spreading



## Cross Talk Net in route\_auto (track SI opt)

#### Baseline

• Net with cross talk over 50ps is highlighted

#### Q 🔯 ∰ Detailed Data 📄 Runs 📄 Stages 🛄 Info 🎄 Settings 🕅 Legend >



Test-

run



Cross talk impact is reduced

## MC16 Results with Spread Repeater



#### Spread repeater options improved results in post route stage

Baseline :

Tested setting :

track utilization 50%

EWI

spread repeater settings

| Stage                                   | Setup                            | Setup         |         |         | Hold  |      |  |
|-----------------------------------------|----------------------------------|---------------|---------|---------|-------|------|--|
|                                         | r2rWNS                           | r2rTNS        | r2rNVE  | HWNS    | HTNS  | HNVE |  |
| <ul> <li>Run: run11-0_</li> </ul>       | baseline_n                       | oOPT_withBour | nds (BA | SELINE) |       |      |  |
| initial_opto                            | -3.01                            | -177.5        | 5461    | -0.060  | -9.3  | 924  |  |
| compile                                 | -2.82                            | -73.3         | 1652    | -0.060  | -6.4  | 685  |  |
| clock_opt_opt                           | -0.29                            | -479.6        | 22485   | -0.010  | -0.2  | 108  |  |
| route_auto                              | -0.31                            | -758.4        | 24015   | -0.050  | -9.3  | 2227 |  |
| route_opt                               | -0.20                            | -60.9         | 1878    | -0.070  | -0.6  | 56   |  |
| <ul> <li>Run: run11-1_</li> </ul>       | Run: run11-1_EWI_TU50_withBounds |               |         |         |       |      |  |
| initial_opto                            | -3.16                            | -368.3        | 11783   | -0.060  | -14.1 | 1453 |  |
| compile                                 | -2.59                            | -74.2         | 1657    | -0.060  | -10.1 | 993  |  |
| clock_opt_opt                           | -0.22                            | -175.2        | 14904   | -0.010  | -0.2  | 134  |  |
| route_auto                              | -0.23                            | -209.1        | 9612    | -0.040  | -9.6  | 2321 |  |
| route_opt                               | -0.15                            | -21.2         | 604     | -0.050  | -0.2  | 33   |  |
| Run: run11-2_EWI_TU50_spread_withBounds |                                  |               |         |         |       |      |  |
| initial_opto                            | -2.88                            | -311.4        | 11458   | -0.080  | -14.7 | 1454 |  |
| compile                                 | -3.31                            | -72.8         | 1445    | -0.070  | -9.8  | 1176 |  |
| clock_opt_opt                           | -0.17                            | -40.9         | 4899    | -0.020  | -0.4  | 152  |  |
| route_auto                              | -0.16                            | -32.1         | 2764    | -0.020  | -6.9  | 2268 |  |
| route_opt                               | -0.11                            | -7.1          | 246     | -0.010  | -0.1  | 24   |  |

## Register Placement Control for Pipeline Structure For Deterministic Placement



- There are many "pipeline" structures.
- In the original runs, 63 bounds were used for pipeline register location control and deterministic register placement.
- Reduction of bounds is desired



## Register Balancing in MC16



19

- Balance registers worked on 253K regs (27% of registers in design)
- Initial\_place results looks reasonable
- Balance registers also could work on non-pure shift register path where amount of comb. logic is small



## **Hierarchy Splitting**

- In absence of design bounds, many hierarchies were being split physically.
- This resulted in timing violations due to net detours.



#### Expected patterns





## **Placement Attraction**



We requested for an update to the RTL to align logical structure and physical structure. Placement attraction can bundle the paths without coordinates.



#### **Placement results**



## Minimizing Bounds and Hierarchy Splitting



#### CL3 to EC1/EC2/EC3 No bounds

•

Guide with bounds



# CL3 EC1 61 A





#### Reduced bounds script for whole MC16 block

| hierarchy1 {{2814.<br>hierarchy2 {{1789.<br>hierarchy3 {{2791.<br>hierarchy4 {{1789.<br>Goreach {hiername | 8930 850.0800}<br>8455 5011.69}<br>8930 5011.69} | {2814.0975<br>{3790.2095 | 1129.9660}} \<br>1129.9660}} \<br>5262.528}} \<br>5262.528}} \      |
|-----------------------------------------------------------------------------------------------------------|--------------------------------------------------|--------------------------|---------------------------------------------------------------------|
| hierarchy4 {{1789.                                                                                        | .8930 5011.69}                                   | •                        |                                                                     |
| Foreach {hiername                                                                                         |                                                  | {2791.8455               | 5262.528}                                                           |
| •                                                                                                         | hhav) thiannama                                  |                          |                                                                     |
| set bounds_cel                                                                                            | · ·                                              |                          | "full_name=~\$hiername/*"]                                          |
| if [sizeof \$bo                                                                                           | ounds_cells] {                                   |                          |                                                                     |
|                                                                                                           | 0                                                |                          | name [sizeof \$bounds_cells]"<br>nd -boundary \$bbox \$bounds_cells |
| }                                                                                                         |                                                  |                          |                                                                     |

coordinate hard coding for only 4 bounds.



snuc

**Si**Five

## Data Flow from CL3 to EC3







## Placement Setting for MC16



#### initial\_place

- Placement attraction to avoid path split
- Balance registers
- (optional) Bounds



#### initial opto

- Spread repeater to avoid crosstalk
- Control upper layer track utilization for cross talk reduction (EWI)









## Summary

With the help of Synopsys, we were able to identify scalable solutions to the following problems:

- Crosstalk issues.
- Excessive usage of higher metal layers.
- Unbalanced pipeline stages.
- Lack of determinism across runs.
- Lack of a scalable recipe that works on any floorplan.

Solution Space that we tried and discussed during this presentation:

- Enable Wire-opt Improvement
- Register Balancing
- Repeater Spreading
- Placement Attractions

QIK flow for MC16 is available for download from SolvNet that contains the major recipe that designers can leverage.









## SELECT THE **BEST SOLUTION**...

THEN SIT BACK AND LET THE TOOL DO THE WORK... ...ACCELERATING YOUR PATH TO **BEST PPA** 

**Si**Five







# THANK YOU

YOUR INNOVATION YOUR COMMUNITY