# **ASPLOS 2025**

# EDM: An Ultra-Low Latency Ethernet Fabric for Memory Disaggregation





Weigao Su, Vishal Shrivastav

### **PURDUE**

## Why memory disaggregation?

- The need for memory is surging
- Constraints of individual servers
- Fine-grained pooling, elastic scaling

# Why is Ethernet promising?

- Dominant datacenter network fabric
- High bandwidth (Terabit Ethernet link)
- Low management cost, distance scaling...





ric link) scalina



# However, the latency in Ethernet is prohibitive, prompting proposals of separate fabric to carry memory traffic

Custom processor interconnect, PCIe, Infiniband, etc.



|                                                          | Scale-Out                                                                                                                                               | NUMA                                                                   |                                                                                         |                                                                                                          |                            |
|----------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------|-----------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------|----------------------------|
| Stanko Novaković                                         | C                                                                                                                                                       | ard Bugnion                                                            | Babak Falsafi                                                                           | Boris Grot†                                                                                              |                            |
|                                                          | EcoCloud,<br>Pond: CXL-Based N                                                                                                                          |                                                                        | ooling Syste                                                                            | ems for Clo                                                                                              | ud Pla                     |
|                                                          | Huaicheng Li<br>Virginia Tech<br>Carnegie Mellon University<br>USA                                                                                      | Daniel S. Berger<br>Microsoft Azure<br>University of Washington<br>USA |                                                                                         | Lisa Hsu<br>Unaffiliated<br>USA                                                                          |                            |
| LegoOS: A D                                              | <b>Disseminated, Distributed C</b><br>Yizhou Shan, Yutong Huan<br><i>Purdue</i>                                                                         |                                                                        |                                                                                         | Disaggregation                                                                                           | ko                         |
| of deployment, operation its in the face of severations. | model where a server is the unit<br>ion, and failure is meeting its lim-<br>al recent hardware and application<br>esource utilization, elasticity, het- | datacenters is<br>often limits th<br>We believe<br>lithic servers      | a painful and cost-<br>ne speed of new hard<br>e that datacenters<br>and organize hardw | s and deploying then<br>ineffective process<br>dware adoption.<br>should break mo<br>vare devices like C | that<br>tou<br>ono-<br>PU, |
|                                                          |                                                                                                                                                         | Rica                                                                   | ardo Bianchini<br>licrosoft Azure<br>USA                                                |                                                                                                          |                            |

#### But, separate fabrics for different traffic makes the network **costly** and **harder to manage**





However, the latency in Ethernet is prohibitive, prompting proposals of separate fabric to carry memory traffic Custom processor interconnect, PCIe, Infiniband, etc.

## A low latency Ethernet fabric would allow us to have a single unified network fabric to carry all kinds of traffic (memory, storage, IP, ...)



## ... easier to manage, lower cost, statistical bandwidth multiplexing

300

iolithic server model where a server is the unit of deployment, operation, and failure is meeting its limits in the face of several recent hardware and application trends. To improve resource utilization, elasticity, het-

often limits the speed of new hardware adoption.

toura

We believe that datacenters should break monolithic servers and organize hardware devices like CPU,

> Ricardo Bianchini Microsoft Azure USA

#### But, separate fabrics for different traffic makes the network costly and harder to manage



# Research goal

# Achieving near intra-server memory access latency over rack-scale Ethernet

(while maintaining high bandwidth utilization)





# Memory Disaggregation over Ethernet





# Memory Disaggregation over Ethernet





### Latency Overheads of Existing Memory Disaggregation over Ethernet An example of remote read request over RDMA







1. Ethernet MAC enforces minimum 64B frame ... but memory messages can be much smaller (e.g., read requests are typically 8-16B)

Padded with 0s



**Bandwidth wastage** 







1. Ethernet MAC enforces minimum 64B frame ... but memory messages can be much smaller (e.g., read requests are typically 8-16B)

### 2. Ethernet MAC enforces minimum of 12 bytes **Inter-frame gap (IFG)**

... high overhead for small memory messages











IFG

1. Ethernet MAC enforces minimum 64B frame ... but memory messages can be much smaller (e.g., read requests are typically 8-16B)

### 2. Ethernet MAC enforces minimum of 12 bytes Inter-frame gap (IFG)

... high overhead for small memory messages

#### **3. Ethernet MAC does not allow intra-frame** preemption

... a large non-memory frame may block the transmission of a small memory message

Mem message arrives

Non-preemptable  $\approx 744ns$  @ 100G

IP frame (e.g., 9KB)





me



1. Ethernet MAC enforces minimum 64B frame ... but memory messages can be much smaller (e.g., read requests are typically 8-16B)

### 2. Ethernet MAC enforces minimum of 12 bytes **Inter-frame gap (IFG)**

... high overhead for small memory messages

#### 3. Ethernet MAC does not allow intra-frame preemption

... a large non-memory frame may block the transmission of a small memory message

## **Root cause: MAC layer processing**





# **Design Choice #1:**

# Implement the entire protocol for remote memory access within Ethernet's Physical layer



## **Rationale for Remote Memory Protocol in PHY**



### Ethernet PHY already reformats a MAC layer frame into a series of 66-bit PHY blocks

- ... thus, unlike the MAC layer that works at a **frame** granularity, PHY works at fine-grained **block** granularity
  - PHY also has access to IFG blocks
  - 66 bit PHY block vs. 64 byte minimum MAC frame size
  - Message interleaving can be done at block granularity in  $\bullet$ PHY rather than at frame granularity in MAC







### Architecture of Remote Memory Protocol in the PHY





### Architecture of Remote Memory Protocol in the PHY





## **Benefits of Remote Memory Protocol in the PHY**



#### **1. Ethernet PHY operates at a fine data granularity** of 66-bit PHY blocks.

Avoids bandwidth wastage for small memory





## **Benefits of Remote Memory Protocol in the PHY**



Can repurpose IFG to carry memory messages.





## **Benefits of Remote Memory Protocol in the PHY**



#### **1. Ethernet PHY operates at a fine data granularity** of 66-bit PHY blocks.

Avoids bandwidth wastage for small memory

#### 2. Ethernet PHY has access to IFG bits.

Can repurpose IFG to carry memory messages.

#### 3. Ethernet PHY enables intra-frame preemption Avoids blocking of small memory message by a large non-memory frame.









### **Remote Memory Protocol in the PHY : What about latency?**





# Design Choice # 2:

# Implement a centralized memory traffic scheduler in the PHY of the switch



## **Central Scheduler in the Switch PHY**





# **Overview of Central Scheduler**

- Step 3: Nodes exchange memory messages over established circuits



[1] Anderson et al. "High Speed switch scheduling for Local Area Networks". TOCS 1993.

• Step 1: Nodes send their memory message demands {src->dst} to the switch scheduler Step 2: Switch scheduler creates virtual circuits by forming a Matching based on demand Naive maximal matching ~O(N); EDM uses Parallel Iterative Matching (PIM) ~O(log(N))





# **Practical Central Scheduler**

- Challenge 1: Low latency for memory messages under bandwidth contention
- Challenge 2: Accurate, low overhead memory traffic demand estimation
- Challenge 3: Line rate, low latency scheduling hardware pipeline



## Challenge # 1: Achieve low latency under bandwidth contention **Solution:** Augment PIM with priority scheduling

- First Come First Serve (FCFS) for <u>light-tailed</u> traffic distribution



Shortest Remaining Processing First (SRPT) for <u>heavy-tailed</u> traffic distribution





## Challenge # 1: Achieve low latency under bandwidth contention **Solution:** Augment PIM with priority scheduling

- First Come First Serve (FCFS) for <u>light-tailed</u> traffic distribution



Shortest Remaining Processing First (SRPT) for <u>heavy-tailed</u> traffic distribution

For SRPT, each demand message from the nodes also contains the size of message Demand messages per node are processed in the increasing order of remaining bytes Matching contention -> prioritize demand messages with smaller remaining bytes







# of data to be read or written

- For reads, read request implicity contains demand for read reply
  - Zero bandwidth and latency overhead
- For writes, send an explicit demand message to switch
  - Small bandwidth overhead (notifications are small)
  - Latency (~RTT/2) is small within a rack



Challenge # 2: Acquire accurate, low overhead memory traffic demand matrix **Solution:** Leverage the nature of memory access interface that specifies amount

iteration (N: number of demand messages) resources for time

• Use combination of constant-time ordered list data structure with a fast priority encoder to implement priority-based PIM

## **EDM can implement each iteration of PIM in exactly 3 clock cycles**

- Challenge # 3: Design line-rate, low latency scheduling hardware pipeline Naive implementation of priority-based PIM would take O(log(N)) cycles per PIM
- **Solution:** Leverage hardware parallelism to intelligently trade-off hardware









# Implementation



### **Hardware Testbed**

- Three Xilinx Alveo U200 FPGAs
- •Open-source 25GbE (Corundum)
- Synopsys ASIC RTL compiler

### **Network Simulator**

- •A single rack with 144 nodes
- •Fed with real-world traces
- Compare against 6 classes of scheduling / congestion control







#### **Evaluation**

End-to-end unloaded latency





### **Evaluation**

Disaggregated workloads in a loaded network





# Summary

- EDM is a low latency Ethernet fabric for memory disaggregation.
- EDM uses two ideas for low latency w/ high bandwidth utilization:
  - EDM implements the protocol for remote memory access entirely in the **Ethernet PHY**.
  - EDM implements a fast, centralized memory traffic scheduler in the switch's PHY.
- EDM incurs a latency of ~300ns (7x lower than RoCE) in an unloaded network, and < 1.3x its unloaded latency under heavy network loads.





# Thank you !

Code: <a href="https://github.com/wegul/EDM">https://github.com/wegul/EDM</a>

