

### Performance Improvements at Full-Chip level using Elastic on Intel XEON designs

Matt Nichelson, SOC Design Engineer Intel

## Agenda



- Problem statement "keeping verification up with Moore's Law"
- Overview of ICV compute options
- A new way of managing FC DRC jobs at Intel
- Sierraforest DRC core/memory usage comparison
- Impact to compute costs using Elastic
- Future Enhancements
- Summary
- Q&A

#### Ref #2 SILICON VALLEY 2024 3

#### Problem Statement - "keeping up with Moore's Law"

- Increasing CPU complexity puts high demand on FC DRC
- Transistor counts are increasing on Si and designs are more complex – need innovation to reduce FC DRC runtimes and compute costs
- A dedicated hardware pool for FC DRC/Tapein is expensive – need to seamlessly incorporate high-capacity jobs into compute farms/cloud.
- Sierraforest XEON product on Intel3 technology needed a way to simplify DRC signoff complexity with new internal compute cost structure.





## Overview of ICV compute options



- Single host
  - Small load on compute farm
  - Very long runtimes (unrealistic)
  - Starts quickly
- Multi-host
  - Large load on compute farm
  - Fast runtime
  - Long delays in starting
- Elastic CPU
  - Optimizes resources (saves \$\$)
  - Good runtime
  - Starts quickly
  - Dynamically add/removes hosts



# A new way of managing Full-Chip DRC jobs



- Historically Full-Chip DRC was unable to flat DRC deck due to extremely long runtimes (multiple days).
  - Solution was to split the DRC deck in multiple flows based on individual layers (~50 flows).
- Each layer/flow was executed on a single machine
  - Unique requirements for memory/core count per flow (high overhead and wasted resources).

| Main Bundle | Split Flow | Host Cores | Host Mem (GB) | Avg Mem (GB) | Peak Mem (GB) | Avg/Host Mem Usage | Peak/Host Mem Usage | Runtime     |
|-------------|------------|------------|---------------|--------------|---------------|--------------------|---------------------|-------------|
| drcd        | drc_NW     | 48         | 1,583,625     | 155,273      | 1,044,111     | 9.8%               | 65.9%               | 18h:51m:56s |
| drcd        | drc_DF     | 24         | 790,908       | 77,286       | 528,821       | 9.8%               | 66.9%               | 11h:16m:12s |
| drcd        | drc_PG     | 48         | 1,056,160     | 15,755       | 511,584       | 1.5%               | 48.4%               | 2h:59m:04s  |
| drcd        | drc_PL     | 48         | 1,583,625     | 83,507       | 699,801       | 5.3%               | 44.2%               | 14h:14m:30s |
| drcd        | drc_M1     | 16         | 790,911       | 45,425       | 525,506       | 5.7%               | 66.4%               | 21h:05m:45s |
| drcd        | drc_M2     | 16         | 790,911       | 49,383       | 283,887       | 6.2%               | 35.9%               | 10h:32m:47s |
| drcd        | drc_M3     | 48         | 2,113,123     | 59,048       | 526,190       | 2.8%               | 24.9%               | 11h:46m:38s |
| drcd        | drc_M4     | 16         | 790,911       | 26,021       | 529,851       | 3.3%               | 67.0%               | 6h:08m:56s  |
| drcd        | drc_M5     | 24         | 1,056,170     | 33,419       | 522,032       | 3.2%               | 49.4%               | 5h:08m:15s  |

• With the introduction of ICV Elastic, Intel transitioned from *split flows* to running *flat drc* and saw <u>significant</u> improvements by allowing the ICV engine to dynamically distribute the full DRC deck across multiple machines. Elastic uses the ICV Validator NXT feature tool license.

#### A new way of managing Full-Chip DRC jobs Snug Legacy Method of running FC DRC Machine n Machine 1 Machine 2 Machine 3 Machine 3 Split flow list drc\_poly Need to manage unique MEM/CORE requirements for each flow/machine! Full DRC deck drc\_diff drc\_m0 Manual layer "split" and "group" based on expected drc\_m1 compute load. ~50 split flows Minion #1 New method using ICV Elastic to Minion +1 manage required computing Primary Minion #2 Full DRC deck **ICV Elastic Scheduler** Machine Minion -1 Elastic determines how many "minion" machines to add based on job load and dependency graph. Minion #3 SNUG SILICON VALLEY 2024 6

## "Elastic" computing adds and removes hosts

- 1. Verification jobs starts on "primary" machine
  - On Sierraforest this machine was typically 1.5TB or 2TB for DRC/Antenna
- 2. ICV internal engine determines when to add/remove additional "minions"
  - DRC data shows average memory on minion hosts to be under 100GB, with average peak memory at 300GB



| Average | = | 282.381 | GB, | . Peak = | = 1827.20 | 57 GB |
|---------|---|---------|-----|----------|-----------|-------|
| Average | = | 77.246  | GΒ, | Peak =   | 645.058   | GB    |
| Average | = | 147.957 | GB, | . Peak = | = 478.579 | 9 GB  |
| Average | = | 53.625  | GΒ, | Peak =   | 324.603   | GB    |
| Average | = | 56.651  | GΒ, | Peak =   | 289.514   | GB    |
| Average | = | 49.986  | GΒ, | Peak =   | 307.020   | GB    |
| Average | = | 55.684  | GΒ, | Peak =   | 243.382   | GB    |
| Average | = | 39.450  | GΒ, | Peak =   | 215.606   | GB    |
| Average | = | 49.449  | GB, | Peak =   | 280.682   | GB    |
| Average | = | 61.323  | GΒ, | Peak =   | 344.481   | GB    |



snu

Primary

Machine

# Sierraforest DRC and core usage comparison



- Multi-host graph shows a fixed cost of **440** cores regardless of process load.
- Elastic ramps up to **440** cores as load increases and then releases cores as job starts to finish.
- Multihost consumed 5,947 core\*hour
- Elastic consumed 3,246 core\*hour
- 45% core cost savings with elastic



# Sierraforest DRC and memory usage comparison



- Multi-host graph shows a fixed memory cost of 6.6TB throughout the entire run.
- Elastic ramps up to **7TB** memory as load increases and then releases memory as job starts to finish.
- Multihost consumed 88TB memory\*hour
- Elastic consumed 56TB memory\*hour
- 36% memory savings with elastic



## Sierraforest Antenna and core usage

snug

- Multihost graph shows a fixed cost of **196** cores regardless of process load.
- Elastic ramps up to **210** cores as load increases and then releases cores as job starts to finish.
- Multihost consumed **3,141** cores over time.
- Elastic consumed **2,421** cores over time.
- 23% core cost savings with elastic
- Multihost runtime was **2.5** hours faster than elastic.



# Sierraforest Antenna and memory usage



- Multihost graph shows a fixed memory cost of **5.5TB** throughout the entire run.
- Elastic ramps up to **5.2TB** memory as load increases and then releases memory as job starts to finish.
- Multihost consumed **86TB** total over time.
- Elastic consumed **61TB** total over time.
- 29% memory savings with elastic.



#### Disk Space consumed by DRC Elastic run



mnichels@scc920013 : du -ksh drc/ 1.4G drc/



#### Impact to compute costs using Elastic

- Using ICV Elastic has direct impact to project cost
  - Higher memory usage and higher core count == higher system requirements and >> cost per job
  - Savings from Elastic come from optimizing the required resources "on the fly"

#### **Machine Performance**

Fastest per-CPU performance

- Single Socket
- Fewer cores / scaling penalty

#### **Slower** per-CPU performance

 More cores / scaling penalty

|                                                         |      | Platform Performance |        |         |  |  |  |
|---------------------------------------------------------|------|----------------------|--------|---------|--|--|--|
| Memory                                                  | Fast |                      | Faster | Fastest |  |  |  |
| 8GB                                                     |      | 1                    | 1.3x   | 1.6x    |  |  |  |
| 16GB                                                    |      | 1.3x                 | 1.7x   | 2.2>    |  |  |  |
| 32GB                                                    |      | 1.7x                 | 2.2x   | 2.7>    |  |  |  |
| 64GB                                                    |      | 2.7x                 | 3.5x   | 4.3>    |  |  |  |
| 128GB                                                   |      | 6.7x                 | 8.7x   | 10.8>   |  |  |  |
| 256GB                                                   |      | 8.3x                 | 10.9x  | 13.5    |  |  |  |
| 512GB                                                   |      | 16x                  | 21x    | 26>     |  |  |  |
| 1TB                                                     |      | 32x                  | 41.8x  | 51.7    |  |  |  |
| 1.5TB                                                   |      | 100x                 |        |         |  |  |  |
| Up to 6TB                                               |      | 200x                 |        |         |  |  |  |
| Intel Confidential Provided to Synopsys Under NDA Ref # |      |                      |        |         |  |  |  |



| Mode       | Cost                                                      |  |
|------------|-----------------------------------------------------------|--|
| Multi-host | \$50.00                                                   |  |
| Elastic    | \$30.85                                                   |  |
| Savings    | 38.30%                                                    |  |
|            |                                                           |  |
| Multi-host | \$35.00                                                   |  |
| Elastic    | \$28.07                                                   |  |
| Savings    | 19.80%                                                    |  |
|            | Multi-host<br>Elastic<br>Savings<br>Multi-host<br>Elastic |  |

Normalized data

# Machine AvailabilityHigher QuantitiesBias: Smaller / quicker jobs

#### Limited Quantities

Specialized workloads & critical path compute



### **Future Enhancements**



- Enable elastic to accept "tiered" minion hosts with varied memory size based on prediction of elastic job scheduler.
  - Compute "heavy" elastic threads (predicted) request larger minion memory class
  - Compute "light" elastic threads (predicted) request smaller minion memory class
  - This distribution of minion sizes will push some threads to smaller/cheaper/faster hardware.
- Using historical runtime info to help guide correct minion hardware future job submissions.
  - Similar to request above but with a more solid prediction engine that records n-1, n-2 elastic jobs for better hardware forecast.
- Reduce host memory footprint for minion jobs to fall into less expensive and more available compute hardware in the Cloud.

# Summary



- Increased CPU complexity + transistor counts on Intel Sierraforest project putting high demand on Full-Chip DRC turn around time.
- Intel transitioned to Elastic for large blocks and Full-Chip level vs split flows.
- Elastic enables significant improvements in reducing compute resources for both DRC and Antenna over Multi-host:
  - DRC reports 45% fewer cores\*hour and 36% less memory\*hour
  - Antenna reports 23% fewer cores\*hour and 29% less memory\*hour
- Cloud cost services are also reduced with Elastic by requiring less hardware for equivalent job throughput:
  - DRC reports 38% cost savings
  - Antenna reports 20% cost savings
- Continued advancements in memory optimization will help push Cloud costs down even further.



# THANK YOU

YOUR INNOVATION YOUR COMMUNITY

#### References

snug

- Reference #1
  - "Intel Processor Transistor Count", Grant McFarland, Intel (PE)
- Reference #2
  - "SierraForest package photo", Tom's hardware
  - https://www.tomshardware.com/news/intel-announces-288-core-processor-5th-gen-xeon-arrives-december-14
- Reference #3, #4, #5
  - Sierraforest CORE/MEM data plots Elastic vs Multi-host, Jon Krause, Intel (PE)
- Reference #6
  - "Batch Compute: Machine Performance & Availability", Rick Ferreri, Intel, (SrPE)