fi_cxi(7) Libfabric Programmer's Manual
NAME
fi_cxi - The CXI Fabric Provider
OVERVIEW
The CXI provider enables libfabric on Cray’s Slingshot network. Slingshot is comprised of the Rosetta switch and Cassini NIC. Slingshot is an Ethernet-compliant network. However, The provider takes advantage of proprietary extensions to support HPC applications.
The CXI provider supports reliable, connection-less endpoint semantics. It supports two-sided messaging interfaces with message matching offloaded by the Cassini NIC. It also supports one-sided RMA and AMO interfaces, light-weight counting events, triggered operations (via the deferred work API), and fabric-accelerated small reductions.
REQUIREMENTS
The CXI Provider requires Cassini’s optimized HPC protocol which is only supported in combination with the Rosetta switch.
The provider uses the libCXI library for control operations and a set of Cassini-specific header files to enable direct hardware access in the data path.
SUPPORTED FEATURES
The CXI provider supports a subset of OFI features.
Endpoint types
The provider supports the FI_EP_RDM endpoint type.
Memory registration modes
The provider implements scalable memory registration. The provider requires FI_MR_ENDPOINT. FI_MR_ALLOCATED is required if ODP in not enabled or not desired. Client specified 32-bit MR keys are the default unless FI_MR_PROV_KEY is specified. For FI_MR_PROV_KEY provider generated 64-bit MR keys are used. An RMA initiator can work concurrently with client and provider generated keys.
In client/server environments, if concerns with stale MR key usage exists, then FI_MR_PROV_KEY generated keys should be used along with FI_CXI_MR_MATCH_EVENTS=1 and FI_CXI_OPTIMIZED_MRS=0. The former speeds up MR close, allowing non-remote MR cached keys to be used that enable full remote memory access protection after an MR is closed, even if that memory remains in the libfabric MR cache. The latter uses only standard MR which use matching to enable robust key usage, protecting against a stale MR key matching a newly generated MR keys.
Data transfer operations
The following data transfer interfaces are supported: FI_ATOMIC, FI_MSG, FI_RMA, FI_TAGGED. See DATA TRANSFER OPERATIONS below for more details.
Completion events
The CXI provider supports all CQ event formats.
Modes
The CXI provider does not require any operation modes.
Progress
The CXI provider currently supports FI_PROGRESS_MANUAL data and control progress modes.
Multi-threading
The CXI provider supports FI_THREAD_SAFE and FI_THREAD_DOMAIN threading models.
Wait Objects
The CXI provider supports FI_WAIT_FD and FI_WAIT_POLLFD CQ wait object types. FI_WAIT_UNSPEC will default to FI_WAIT_FD. However FI_WAIT_NONE should achieve the lowest latency and reduce interrupt overhead.
Additional Features
The CXI provider also supports the following capabilities and features:
- FI_MULTI_RECV
- FI_SOURCE
- FI_NAMED_RX_CTX
- FI_RM_ENABLED
- FI_RMA_EVENT
- FI_REMOTE_CQ_DATA
- FI_MORE
- FI_FENCE
Addressing Format
The CXI provider uses a proprietary address format. This format includes fields for NIC Address and PID. NIC Address is the topological address of the NIC endpoint on the fabric. All OFI Endpoints sharing a Domain share the same NIC Address. PID (for Port ID or Process ID, adopted from the Portals 4 specification), is analogous to an IP socket port number. Valid PIDs are in the range [0-510].
A third component of Slingshot network addressing is the Virtual Network ID (VNI). VNI is a protection key used by the Slingshot network to provide isolation between applications. A VNI defines an isolated PID space for a given NIC. Therefore, Endpoints must use the same VNI in order to communicate. Note that VNI is not a field of the CXI address, but rather is specified as part of the OFI Endpoint auth_key. The combination of NIC Address, VNI, and PID is unique to a single OFI Endpoint within a Slingshot fabric.
The NIC Address of an OFI Endpoint is inherited from the Domain. By default, a PID is automatically assigned to an Endpoint when it is enabled. The address of an Endpoint can be queried using fi_getname. The address received from fi_getname may then be inserted into a peer’s Address Vector. The resulting FI address may then be used to perform an RDMA operation.
Alternatively, a client may manage PID assignment. fi_getinfo may be used to create an fi_info structure that can be used to create an Endpoint with a client-specified address. To achieve this, use fi_getinfo with the FI_SOURCE flag set and set node and service strings to represent the local NIC interface and PID to be assigned to the Endpoint. The NIC interface string should match the name of an available CXI domain (in the format cxi[0-9]). The PID string will be interpreted as a 9-bit integer. Address conflicts will be detected when the Endpoint is enabled.
Authorization Keys
The CXI authorization key format is defined by struct cxi_auth_key. This structure is defined in fi_cxi_ext.h.
struct cxi_auth_key {
uint32_t svc_id;
uint16_t vni;
};
The CXI authorization key format includes a VNI and CXI service ID. VNI is a component of the CXI Endpoint address that provides isolation. A CXI service is a software container which defines a set of local CXI resources, VNIs, and Traffic Classes which a libfabric user can access.
Two endpoints must use the same VNI in order to communicate. Generally, a parallel application should be assigned to a unique VNI on the fabric in order to achieve network traffic and address isolation. Typically a privileged entity, like a job launcher, will allocate one or more VNIs for use by the libfabric user.
The CXI service API is provided by libCXI. It enables a privileged entity, like an application launcher, to control an unprivileged process’s access to NIC resources. Generally, a parallel application should be assigned to a unique CXI service in order to control access to local resources, VNIs, and Traffic Classes.
While a libfabric user provided authorization key is optional, it is highly
encouraged that libfabric users provide an authorization key through the domain
attribute hints during fi_getinfo()
. How libfabric users acquire the
authorization key may vary between the users and is outside the scope of this
document.
If an authorization key is not provided by the libfabric user, the CXI provider will attempt to generate an authorization key on behalf of the user. The following outlines how the CXI provider will attempt to generate an authorization key.
- Query for the following environment variables and generate an authorization
key using them.
- SLINGSHOT_VNIS: Comma separated list of VNIs. The CXI provider will only
use the first VNI if multiple are provide. Example:
SLINGSHOT_VNIS=234
. - SLINGSHOT_DEVICES: Comma separated list of device names. Each device index
will use the same index to lookup the service ID in SLINGSHOT_SVC_IDS.
Example:
SLINGSHOT_DEVICES=cxi0,cxi1
. - SLINGSHOT_SVC_IDS: Comma separated list of pre-configured CXI service IDs.
Each service ID index will use the same index to lookup the CXI device in
SLINGSHOT_DEVICES. Example:
SLINGSHOT_SVC_IDS=5,6
.
Note: How valid VNIs and device services are configured is outside the responsibility of the CXI provider.
- SLINGSHOT_VNIS: Comma separated list of VNIs. The CXI provider will only
use the first VNI if multiple are provide. Example:
-
Query pre-configured device services and find first entry with same UID as the libfabric user.
-
Query pre-configured device services and find first entry with same GID as the libfabric user.
-
Query pre-configured device services and find first entry which does not restrict member access. If enabled, the default service is an example of an unrestricted service.
Note: There is a security concern with such services since it allows for multiple independent libfabric users to use the same service.
Note: For above entries 2-4, it is possible the found device service does not restrict VNI access. For such cases, the CXI provider will query FI_CXI_DEFAULT_VNI to assign a VNI.
During Domain allocation, if the domain auth_key attribute is NULL, the CXI provider will attempt to generate a valid authorization key. If the domain auth_key attribute is valid (i.e. not NULL and encoded authorization key has been verified), the CXI provider will use the encoded VNI and service ID. Failure to generate a valid authorization key will result in Domain allocation failure.
During Endpoint allocation, if the endpoint auth_key attribute is NULL, the Endpoint with inherit the parent Domain’s VNI and service ID. If the Endpoint auth_key attribute is valid, the encoded VNI and service ID must match the parent Domain’s VNI and service ID. Allocating an Endpoint with a different VNI and service from the parent Domain is not supported.
The following is the expected parallel application launch workflow with CXI integrated launcher and CXI authorization key aware libfabric user:
- A parallel application is launched.
- The launcher allocates one or more VNIs for use by the application.
- The launcher communicates with compute node daemons where the application will be run.
- The launcher compute node daemon configures local CXI interfaces. libCXI is used to allocate one or more services for the application. The service will define the local resources, VNIs, and Traffic Classes that the application may access. Service allocation policies must be defined by the launcher. libCXI returns an ID to represent a service.
- The launcher forks application processes.
- The launcher provides one or more service IDs and VNI values to the application processes.
- Application processes select from the list of available service IDs and VNIs to form an authorization key to use for Endpoint allocation.
Endpoint Protocols
The provider supports multiple endpoint protocols. The default protocol is FI_PROTO_CXI and fully supports the messaging requirements of parallel applicaitons.
The FI_PROTO_CXI_RNR endpoint protocol is an optional protocol that targets client/server environments where send-after-send ordering is not required and messaging is generally to pre-posted buffers; FI_MULTI_RECV is recommended. It utilizes a receiver-not-ready implementation where FI_CXI_RNR_MAX_TIMEOUT_US can be tuned to control the maximum retry duration.
Address Vectors
The CXI provider supports both FI_AV_TABLE and FI_AV_MAP with the same internal implementation.
The CXI provider uses the FI_SYMMETRIC AV flag for optimization. When used with FI_AV_TABLE, the CXI provider can use the fi_addr_t index as an endpoint identifier instead of a network address. The benefit of this is when running with FI_SOURCE, a reverse lookup is not needed to generate the source fi_addr_t for target CQ events. Note: FI_SOURCE_ERR should not be used for this configuration.
If the AV is not configured with FI_SYMMETRIC, FI_AV_USER_ID is supported as a flag which can be passed into AV insert.
Since scalable EPs are not supported, fi_av_attr::rx_ctx_bits must be zero.
The following AV capabilities and flags are not supported: FI_SHARED_AV, FI_SYNC_ERR, FI_EVENT, and FI_READ.
Operation flags
The CXI provider supports the following Operation flags:
- FI_MORE
- When FI_MORE is specified in a data transfer operation, the provider will
defer submission of RDMA commands to hardware. When one or more data
transfer operations is performed using FI_MORE, followed by an operation
without FI_MORE, the provider will submit the entire batch of queued
operations to hardware using a single PCIe transaction, improving PCIe
efficiency.
When FI_MORE is used, queued commands will not be submitted to hardware until another data transfer operation is performed without FI_MORE.
- FI_TRANSMIT_COMPLETE
- By default, all CXI provider completion events satisfy the requirements of the ‘transmit complete’ completion level. Transmit complete events are generated when the intiator receives an Ack from the target NIC. The Ack is generated once all data has been received by the target NIC. Transmit complete events do not guarantee that data is visibile to the target process.
- FI_DELIVERY_COMPLETE
- When the ‘delivery complete’ completion level is used, the event guarantees that data is visible to the target process. To support this, hardware at the target performs a zero-byte read operation to flush data across the PCIe bus before generating an Ack. Flushing reads are performed unconditionally and will lead to higher latency.
- FI_MATCH_COMPLETE
- When the ‘match complete’ completion level is used, the event guarantees that the message has been matched to a client-provided buffer. All messages longer than the eager threshold support this guarantee. When ‘match complete’ is used with a Send that is shorter than the eager threshold, an additional handshake may be performed by the provider to notify the initiator that the Send has been matched.
The CXI provider also supports the following operation flags:
- FI_INJECT
- FI_FENCE
- FI_COMPLETION
- FI_REMOTE_CQ_DATA
Scalable Endpoints
Scalable Endpoints (SEPs) support is not enabled in the CXI provider. Future releases of the provider will re-introduce SEP support.
Messaging
The CXI provider supports both tagged (FI_TAGGED) and untagged (FI_MSG) two-sided messaging interfaces. In the normal case, message matching is performed by hardware. In certain low resource conditions, the responsibility to perform message matching may be transferred to software. Specification of the receive message matching mode in the environment (FI_CXI_RX_MATCH_MODE) controls the initial matching mode and whether hardware matching can transparently transition matching to software where a hybrid of hardware and software receive matching is done.
If a Send operation arrives at a node where there is no matching Receive operation posted, it is considered unexpected. Unexpected messages are supported. The provider manages buffers to hold unexpected message data.
Unexpected message handling is transparent to clients. Despite that, clients should take care to avoid excessive use of unexpected messages by pre-posting Receive operations. An unexpected message ties up hardware and memory resources until it is matched with a user buffer.
The CXI provider implements several message protocols internally. A message protocol is selected based on payload length. Short messages are transferred using the eager protocol. In the eager protocol, the entire message payload is sent along with the message header. If an eager message arrives unexpectedly, the entire message is buffered at the target until it is matched to a Receive operation.
Long messages are transferred using a rendezvous protocol. The threshold at which the rendezvous protocol is used is controlled with the FI_CXI_RDZV_THRESHOLD and FI_CXI_RDZV_GET_MIN environment variables.
In the rendezvous protocol, a portion of the message payload is sent along with the message header. Once the header is matched to a Receive operation, the remainder of the payload is pulled from the source using an RDMA Get operation. If the message arrives unexpectedly, the eager portion of the payload is buffered at the target until it is matched to a Receive operation. In the normal case, the Get is performed by hardware and the operation completes without software progress.
Unexpected rendezvous protocol messages can not complete and release source side buffer resources until a matching receive is posted at the destination and the non-eager data is read from the source with a rendezvous get DMA. The number of rendezvous messages that may be outstanding is limited by the minimum of the hints->tx_attr->size value specified and the number of rendezvous operation ID mappings available. FI_TAGGED rendezvous messages have 32K-256 ID mappings, FI_MSG rendezvous messages are limited to 256 ID mappings. While this works well with MPI, care should be taken that this minimum is large enough to ensure applications written in a manner that assumes unlimited resources and use FI_MSG rendezvous messaging do not induce a software deadlock. If FI_MSG rendezvous messaging is done in a unexpected manner that may exceed the FI_MSG ID mappings available, it may be sufficient to reduce the number of rendezvous operations by increasing the rendezvous threshold. See FI_CXI_RDZV_THRESHOLD for information.
Message flow-control is triggered when hardware message matching resources become exhausted. Messages may be dropped and retransmitted in order to recover; impacting performance significantly. Programs should be careful to avoid posting large numbers of unmatched receive operations and to minimize the number of outstanding unexpected messages to prevent message flow-control. If the RX message matching mode is configured to support hybrid mode, when resources are exhausted, hardware will transition to hybrid operation where hardware and software share matching responsibility.
To help avoid this condition, increase Overflow buffer space using environment variables FI_CXI_OFLOW_*, and for software and hybrid RX match modes increase Request buffer space using the variables FI_CXI_REQ_*.
Message Ordering
The CXI provider supports the following ordering rules:
- All message Send operations are always ordered.
- RMA Writes may be ordered by specifying FI_ORDER_RMA_WAW.
-
AMOs may be ordered by specifying *FI_ORDER_AMO_{WAW WAR RAW RAR}*. - RMA Writes may be ordered with respect to AMOs by specifying FI_ORDER_WAW. Fetching AMOs may be used to perform short reads that are ordered with respect to RMA Writes.
Ordered RMA size limits are set as follows:
- max_order_waw_size is -1. RMA Writes and non-fetching AMOs of any size are ordered with respect to each other.
- max_order_raw_size is -1. Fetching AMOs of any size are ordered with respect to RMA Writes and non-fetching AMOs.
- max_order_war_size is -1. RMA Writes and non-fetching AMOs of any size are ordered with respect to fetching AMOs.
PCIe Ordering
Generally, PCIe writes are strictly ordered. As an optimization, PCIe TLPs may have the Relaxed Order (RO) bit set to allow writes to be reordered. Cassini sets the RO bit in PCIe TLPs when possible. Cassini sets PCIe RO as follows:
- Ordering of messaging operations is established using completion events. Therefore, all PCIe TLPs related to two-sided message payloads will have RO set.
- Every PCIe TLP associated with an unordered RMA or AMO operation will have RO cleared.
- PCIe TLPs associated with the last packet of an ordered RMA or AMO operation will have RO cleared.
- PCIe TLPs associated with the body packets (all except the last packet of an operation) of an ordered RMA operation will have RO set.
Translation
The CXI provider supports two translation mechanisms: Address Translation Services (ATS) and NIC Translation Agent (NTA). Use the environment variable FI_CXI_ATS to select between translation mechanisms.
ATS refers to NIC support for PCIe rev. 4 ATS, PRI and PASID features. ATS enables the NIC to efficiently access the entire virtual address space of a process. ATS mode currently supports AMD hosts using the iommu_v2 API.
The NTA is an on-NIC translation unit. The NTA supports two-level page tables and additional hugepage sizes. Most CPUs support 2MB and 1GB hugepage sizes. Other hugepage sizes may be supported by SW to enable the NIC to cache more address space.
ATS and NTA both support on-demand paging (ODP) in the event of a page fault. Use the environment variable FI_CXI_ODP to enable ODP.
With ODP enabled, buffers used for data transfers are not required to be backed by physical memory. An un-populated buffer that is referenced by the NIC will incur a network page fault. Network page faults will significantly impact application performance. Clients should take care to pre-populate buffers used for data-tranfer operations to avoid network page faults. Copy-on-write semantics work as expected with ODP.
With ODP disabled, all buffers used for data transfers are backed by pinned physical memory. Using Pinned mode avoids any overhead due to network page faults but requires all buffers to be backed by physical memory. Copy-on-write semantics are broken when using pinned memory. See the Fork section for more information.
The CXI provider supports DMABUF for device memory registration. DMABUF is supported in ROCm 5.6+ and Cuda 11.7+ with nvidia open source driver 525+. Both FI_HMEM_ROCR_USE_DMABUF and FI_HMEM_CUDA_USE_DMABUF are disabled by default in libfabric core but the CXI provider enables *FI_HMEM_ROCR_USE_DMABUF by default if not specifically set. There may be situations with CUDA that may double the BAR consumption. Until this is fixed in the CUDA stack, CUDA DMABUF will be disabled by default.
Translation Cache
Mapping a buffer for use by the NIC is an expensive operation. To avoid this penalty for each data transfer operation, the CXI provider maintains an internal translation cache.
When using the ATS translation mode, the provider does not maintain translations for individual buffers. It follows that translation caching is not required.
Triggered Operation
The CXI provider supports triggered operations through the deferred work queue API. The following deferred work queue operations are supported: FI_OP_SEND, FI_OP_TSEND, FI_OP_READ, FI_OP_WRITE, FI_OP_ATOMIC, FI_OP_FETCH_ATOMIC, and FI_OP_COMPARE_ATOMIC. FI_OP_RECV and FI_OP_TRECV are also supported, but with only a threshold of zero.
The CXI provider backs each triggered operation by hardware resources. Exhausting triggered operation resources leads to indeterminate behavior and should be prevented.
The CXI provider offers two methods to prevent triggered operation resource exhaustion.
Experimental FI_CXI_ENABLE_TRIG_OP_LIMIT Environment Variable
When FI_CXI_ENABLE_TRIG_OP_LIMIT is enabled, the CXI provider will use semaphores to coordinate triggered operation usage between threads and across processes using the same service ID. When triggered operation resources are exhausted, fi_control(FI_QUEUE_WORK) will return -FI_ENOSPC. It is up to the libfabric user to recover from this situation.
Note: Preventing triggered operation resource exhaustion with this method may be expensive and result in a negative performance impact. It is encouraged libfabric users avoid method unless absolutely needed. By default, FI_CXI_ENABLE_TRIG_OP_LIMIT is disabled.
Note: Named semaphores are used to coordinated triggered operation resource usage across multiple processes. System/node software may need to be implemented to ensure all semaphores are unlinked during unexpected application termination.
Note: This feature is considered experimental and implementation may be subjected to changed.
CXI Domain get_dwq_depth Extension
The CXI domain get_dwq_depth extension returns the deferred work queue queue depth (i.e. the number of triggered operation resources assigned to the service ID used by the fi_domain). Libfabric users can use the returned queue depth to coordinate resource usage.
For example, suppose the job launcher has configured a service ID with for 512 triggered operation resources. Since the CXI provider needs to consume 8 per service ID, 504 should be usable by libfabric users. If the libfabric user knows there are N processes using a given service ID and NIC, it can divide the 504 triggered operation resource among all N processes.
Note: This is the preferred method to prevent triggered operation resource exhaustion since it does not introduce semaphores into the fi_control(FI_QUEUE_WORK) critical path.
Fork Support
The following subsections outline the CXI provider fork support.
RDMA and Fork Overview
Under Linux, fork()
is implemented using copy-on-write (COW) pages, so the
only penalty that it incurs is the time and memory required to duplicate the
parent’s page tables, mark all of the process’s page structs as read only and
COW, and create a unique task structure for the child.
Due to the Linux COW fork policy, both parent and child processes’ virtual addresses are mapped to the same physical address. The first process to write to the virtual address will get a new physical page, and thus a new physical address, with the same content as the previous physical page.
The Linux COW fork policy is problematic for RDMA NICs. RDMA NICs require memory to be registered with the NIC prior to executing any RDMA operations. In user-space, memory registration results in establishing a virtual address to physical address mapping with the RDMA NIC. This resulting RDMA NIC mapping/memory region does not get updated when the Linux COW fork policy is executed.
Consider the following example:
- Process A is planning to perform RDMA with virtual address 0xffff0000 and a size of 4096. This virtual address maps to physical address 0x1000.
- Process A registers this virtual address range with the RDMA NIC. The RDMA NIC device driver programs its page tables to establish the virtual address 0xffff0000 to physical address 0x1000 mapping.
- Process A decides to fork Process B. Virtual address 0xffff0000 will now be subjected to COW.
- Process A decides to write to virtual address 0xffff0000 before doing the
RDMA operation. This will trigger the Linux COW fork policy resulting in the
following:
- Process A: Virtual address 0xffff0000 maps to new physical address 0x2000
- Process B: Virtual address 0xffff0000 maps to previous physical address 0x1000
- Process A now executes an RDMA operation using the mapping/memory region associated with virtual address 0xffff0000. Since COW occurred, the RDMA NIC executes the RDMA operation using physical address 0x1000 which belongs to Process B. This results in data corruption.
The crux of the issue is the parent issuing forks while trying to do RDMA operations to registered memory regions. Excluding software RDMA emulation, two options exist for RDMA NIC vendors to resolve this data corruption issue.
- Linux
madvise()
MADV_DONTFORK and MADV_DOFORK - RDMA NIC support for on-demand paging (ODP)
Linux madvise() MADV_DONTFORK and MADV_DOFORK
The generic (i.e. non-vendor specific) RDMA NIC solution to the Linux COW fork
policy and RDMA problem is to use the following madvise()
operations during
memory registration and deregistration:
- MADV_DONTFORK: Do not make the pages in this range available to the child
after a
fork()
. This is useful to prevent copy-on-write semantics from changing the physical location of a page if the parent writes to it after afork()
. (Such page relocations cause problems for hardware that DMAs into the page.) - MADV_DOFORK: Undo the effect of MADV_DONTFORK, restoring the default
behavior, whereby a mapping is inherited across
fork()
.
In the Linux kernel, MADV_DONTFORK will result in the virtual memory area struct (VMA) being marked with the VM_DONTCOPY flag. VM_DONTCOPY signals to the Linux kernel to not duplicate this VMA on fork. This effectively leaves a hole in child process address space. Should the child reference the virtual address corresponding to the VMA which was not duplicated, it will segfault.
In the previous example, if Process A issued madvise(0xffff0000, 4096,
MADV_DONTFORK)
before performing RDMA memory registration, the physical address
0x1000 would have remained with Process A. This would prevent the Process A data
corruption as well. If Process B were to reference virtual address 0xffff0000, it
will segfault due to the hole in the virtual address space.
Using madvise()
with MADV_DONTFORK may be problematic for applications
performing RDMA and page aliasing. Paging aliasing is where the parent process
uses part or all of a page to share information with the child process. If RDMA is
also being used for a separate portion of this page, the child process will
segfault when an access causes page aliasing.
RDMA NIC Support for ODP
An RDMA NIC vendor specific solution to the Linux COW fork policy and RDMA problem is to use ODP. ODP allows for the RDMA NIC to generate page requests for translations it does not have a physical address for. The following is an updated example with ODP:
- Process A is planning to perform RDMA with virtual address 0xffff0000 and a size of 4096. This virtual address maps to physical address 0x1000.
- Process A registers this virtual address range with the RDMA NIC. The RDMA NIC device driver may optionally program its page tables to establish the virtual address 0xffff0000 to physical address 0x1000 mapping.
- Process A decides to fork Process B. Virtual address 0xffff0000 will now be subjected to COW.
- Process A decides to write to virtual address 0xffff0000 before doing the RDMA
operation. This will trigger the Linux COW fork policy resulting in the
following:
- Process A: Virtual address 0xffff0000 maps to new physical address 0x2000
- Process B: Virtual address 0xffff0000 maps to previous physical address 0x1000
- RDMA NIC device driver: Receives MMU invalidation event for Process A virtual address range 0xffff0000 through 0xffff0ffe. The device driver updates the corresponding memory region to no longer reference physical address 0x1000.
- Process A now executes an RDMA operation using the memory region associated with 0xffff0000. The RDMA NIC will recognize the corresponding memory region as no longer having a valid physical address. The RDMA NIC will then signal to the device driver to fault in the corresponding address, if necessary, and update the physical address associated with the memory region. In this case, the memory region will be updated with physical address 0x2000. Once completed, the device driver signals to the RDMA NIC to continue the RDMA operation. Data corruption does not occur since RDMA occurred to the correct physical address.
A RDMA NIC vendor specific solution to the Linux COW fork policy and RDMA problem is to use ODP. ODP allows for the RDMA NIC to generate page requests for translations it does not have a physical address for.
CXI Provider Fork Support
The CXI provider is subjected to the Linux COW fork policy and RDMA issues described in section RDMA and Fork Overview. To prevent data corruption with fork, the CXI provider supports the following options:
- CXI specific fork environment variables to enable
madvise()
MADV_DONTFORK and MADV_DOFORK - ODP Support*
*Formal ODP support pending.
CXI Specific Fork Environment Variables
The CXI software stack has two environment variables related to fork:
0 CXI_FORK_SAFE: Enables base fork safe support. With this environment variable
set, regardless of value, libcxi will issue madvise()
with MADV_DONTFORK on
the virtual address range being registered for RDMA. In addition, libcxi always
align the madvise()
to the system default page size. On x86, this is 4 KiB. To
prevent redundant madvise()
calls with MADV_DONTFORK against the same virtual
address region, reference counting is used against each tracked madvise()
region. In addition, libcxi will spilt and merge tracked madvise()
regions if
needed. Once the reference count reaches zero, libcxi will call madvise()
with
MADV_DOFORK, and no longer track the region.
- CXI_FORK_SAFE_HP: With this environment variable set, in conjunction with
CXI_FORK_SAFE, libcxi will not assume the page size is system default page size.
Instead, libcxi will walk
/proc/<pid>/smaps
to determine the correct page size and align themadvise()
calls accordingly. This environment variable should be set if huge pages are being used for RDMA. To amortize the per memory registration walk of/proc/<pid>/smaps
, the libfabric MR cache should be used.
Setting these environment variables will prevent data corruption when the parent issues a fork. But it may result in the child process experiencing a segfault if it references a virtual address being used for RDMA in the parent process.
ODP Support and Fork
CXI provider ODP support would allow for applications to not have to set CXI_FORK_SAFE and CXI_FORK_SAFE_HP to prevent parent process data corruption. Enabling ODP to resolve the RDMA and fork issue may or may not result in a performance impact. The concern with ODP is if the rate of invalidations and ODP page requests are relatively high and occur at the same time, ODP timeouts may occur. This would result in application libfabric data transfer operations failing.
Please refer to the CXI Provider ODP Support for more information on how to enable/disable ODP.
CXI Provider Fork Support Guidance
Since the CXI provider offloads the majority of the libfabric data transfer operations to the NIC, thus enabling end-to-end RDMA between libfabric user buffers, it is subjected to the issue described in section RDMA and Fork Overview. For comparison, software emulated RDMA libfabric providers may not have these issues since they rely on bounce buffers to facilitate data transfer.
The following is the CXI provider fork support guidance:
- Enable CXI_FORK_SAFE. If huge pages are also used, CXI_FORK_SAFE_HP should be
enabled as well. Since enabling this will result in
madvice()
with MADV_DONTFORK, the following steps should be taken to prevent a child process segfault:- Avoid using stack memory for RDMA
- Avoid child process having to access a virtual address range the parent process is performing RDMA against
- Use page-aligned heap allocations for RDMA
- Enable ODP and run without CXI_FORK_SAFE and CXI_FORK_SAFE_HP. The functionality and performance of ODP with fork may be application specific. Currently, ODP is not formally supported.
The CXI provider preferred approach is to use CXI_FORK_SAFE and CXI_FORK_SAFE_HP. While it may require the application to take certain precautions, it will result in a more portable application regardless of RDMA NIC.
Heterogenous Memory (HMEM) Supported Interfaces
The CXI provider supports the following OFI iface types: FI_HMEM_CUDA, FI_HMEM_ROCR, and FI_HMEM_ZE.
FI_HMEM_ZE Limitations
The CXI provider only supports GPU direct RDMA with ZE device buffers if implicit scaling is disabled. The following ZE environment variables disable implicit scaling: EnableImplicitScaling=0 NEOReadDebugKeys=1.
For testing purposes only, the implicit scaling check can be disabled by setting the following environment variable: FI_CXI_FORCE_ZE_HMEM_SUPPORT=1. This may need to be combined with the following environment variable to get CXI provider memory registration to work: FI_CXI_DISABLE_HMEM_DEV_REGISTER=1.
Collectives (accelerated)
The CXI provider supports a limited set of collective operations specifically intended to support use of the hardware-accelerated reduction features of the CXI-supported NIC and fabric hardware.
These features are implemented using the (experimental) OFI collectives API. The implementation supports the following collective functions:
- fi_query_collective()
- fi_join_collective()
- fi_barrier()
- fi_broadcast()
- fi_reduce()
- fi_allreduce()
fi_query_collective()
Standard implementation that exposes the features described below.
fi_join_collective()
The fi_join_collective() implementation is provider-managed. However, the coll_addr parameter is not useful to the implementation, and must be specified as FI_ADDR_NOTAVAIL. The set parameter must contain fi_addr_t values that resolve to meaningful CXI addresses in the endpoint fi_av structure. fi_join_collective() must be called for every address in the set list, and must be progressed until the join operation is complete. There is no inherent limit on join concurrency.
The join will create a multicast tree in the fabric to manage the collective operations. This operation requires access to a secure Fabric Manager REST API that constructs this tree, so any application that attempts to use accelerated collectives will bind to libcurl and associated security libraries, which must be available on the system.
There are hard limits to the number of multicast addresses available on a system, and administrators may impose additional limits on the number of multicast addresses available to any given collective job.
fi_reduction operations
Payloads are limited to 32-byte data structures, and because they all use the same underlying hardware model, they are all synchronizing calls. Specifically, the supported functions are all variants of fi_allreduce().
- fi_barrier is fi_allreduce using an optimized no-data operator.
- fi_broadcast is fi_allreduce using FI_BOR, with data forced to zero for all but the root rank.
- fi_reduce is fi_allreduce with a result pointer ignored by all but the root rank.
All functions must be progressed to completion on all ranks participating in the collective group. There is a hard limit of eight concurrent reductions on each collective group, and attempts to launch more operations will return
allreduce supports the following hardware-accelerated reduction operators:
Operator | Supported Datatypes |
---|---|
FI_BOR | FI_UINT8, FI_UINT16, FI_UINT32, FI_UINT64 |
FI_BAND | FI_UINT8, FI_UINT16, FI_UINT32, FI_UINT64 |
FI_BXOR | FI_UINT8, FI_UINT16, FI_UINT32, FI_UINT64 |
FI_MIN | FI_INT64, FI_DOUBLE |
FI_MAX | FI_INT64, FI_DOUBLE |
FI_SUM | FI_INT64, FI_DOUBLE |
FI_CXI_MINMAXLOC | FI_INT64, FI_DOUBLE |
FI_CXI_REPSUM | FI_DOUBLE |
Data space is limited to 32 bytes in all cases except REPSUM, which supports only a single FI_DOUBLE.
Only unsigned bitwise operators are supported.
Only signed integer arithmetic operations are are supported.
The MINMAXLOC operators are a mixed data representation consisting of two values, and two indices. Each rank reports its minimum value and rank index, and its maximum value and rank index. The collective result is the global minimum value and rank index, and the global maximum value and rank index. Data structures for these functions can be found int the fi_cxi_ext.h file. The datatype should represent the type of the minimum/maximum values, and the count must be 1.
The double-precision operators provide an associative (NUM) variant for MIN, MAX, and MINMAXLOC. Default IEEE behavior is to treat any operation with NaN as invalid, including comparison, which has the interesting property of causing:
MIN(NaN, value) => NaN
MAX(NaN, value) => NaN
This means that if NaN creeps into a MIN/MAX reduction in any rank, it tends to poison the entire result. The associative variants instead effectively ignore the NaN, such that:
MIN(NaN, value) => value
MAX(NaN, value) => value
The REPSUM operator implements a reproducible (associative) sum of double-precision values. The payload can accommodate only a single double-precision value per reduction, so count must be 1.
See: Berkeley reproducible sum algorithm https://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-121.pdf
double precision rounding
C99 defines four rounding modes for double-precision SUM, and some systems may support a “flush-to-zero” mode for each of these, resulting in a total of eight different modes for double-precision sum.
The fabric hardware supports all eight modes transparently.
Although the rounding modes have thread scope, all threads, processes, and nodes should use the same rounding mode for any single reduction.
reduction flags
The reduction operations supports two flags:
- FI_MORE
- FI_CXI_PRE_REDUCED (overloads FI_SOURCE)
The FI_MORE flag advises that the result data pointer represents an opaque, local reduction accumulator, and will be used as the destination of the reduction. This operation can be repeated any number of times to accumulate results locally, and spans the full set of all supported reduction operators. The op, count, and datatype values must be consistent for all calls. The operation ignores all global or static variables — it can be treated as a pure function call — and returns immediately. The caller is responsible for protecting the accumulator memory if it is used by multiple threads or processes on a compute node.
If FI_MORE is omitted, the destination is the fabric, and this will initiate a fabric reduction through the associated endpoint. The reduction must be progressed, and upon successful completion, the result data pointer will be filled with the final reduction result of count elements of type datatype.
The FI_CXI_PRE_REDUCED flag advises that the source data pointer represents an opaque reduction accumulator containing pre-reduced data. The count and datatype arguments are ignored.
if FI_CXI_PRE_REDUCED is omitted, the source is taken to be user data with count elements of type datatype.
The opaque reduction accumulator is exposed as struct cxip_coll_accumulator in the fi_cxi_ext.h file.
Note: The opaque reduction accumulator provides extra space for the expanded form of the reproducible sum, which carries the extra data required to make the operation reproducible in software.
OPTIMIZATION
Optimized MRs
The CXI provider has two separate MR implementations: standard and optimized. Standard MRs are designed to support applications which require a large number of remote memory regions. Optimized MRs are designed to support one-sided programming models that allocate a small number of large remote memory windows. The CXI provider can achieve higher RMA Write rates when targeting an optimized MR.
Both types of MRs are allocated using fi_mr_reg. MRs with client-provided key in the range [0-99] are optimized MRs. MRs with key greater or equal to 100 are standard MRs. An application may create a mix of standard and optimized MRs. To disable the use of optimized MRs, set environment variable FI_CXI_OPTIMIZED_MRS=false. When disabled, all MR keys are available and all MRs are implemented as standard MRs. All communicating processes must agree on the use of optimized MRs.
When FI_MR_PROV_KEY mr_mode is specified caching of remote access MRs is enabled, which can improve registration/de-registration performance in RPC type applications, that wrap RMA operations within a message RPC protocol. Optimized MRs will be preferred, but will fallback to standard MRs if insufficient hardware resources are available.
Optimized RMA
Optimized MRs are one requirement for the use of low overhead packet formats which enable higher RMA Write rates. An RMA Write will use the low overhead format when all the following requirements are met:
- The Write targets an optimized MR
- The target MR does not require remote completion notifications (no FI_RMA_EVENT)
- The Write does not have ordering requirements (no FI_RMA_WAW)
Theoretically, Cassini has resources to support 64k standard MRs or 2k optimized MRs. Practically, the limits are much lower and depend greatly on application behavior.
Hardware counters can be used to validate the use of the low overhead packets. The counter C_CNTR_IXE_RX_PTL_RESTRICTED_PKT counts the number of low overhead packets received at the target NIC. Counter C_CNTR_IXE_RX_PTL_UNRESTRICTED_PKT counts the number of ordered RDMA packets received at the target NIC.
Message rate performance may be further optimized by avoiding target counting events. To avoid counting events, do not bind a counter to the MR. To validate optimal writes without target counting events, monitor the counter: C_CNTR_LPE_PLEC_HITS.
Unreliable AMOs
By default, all AMOs are resilient to intermittent packet loss in the network. Cassini implements a connection-based reliability model to support reliable execution of AMOs.
The connection-based reliability model may be disabled for AMOs in order to increase message rate. With reliability disabled, a lost AMO packet will result in operation failure. A failed AMO will be reported to the client in a completion event as usual. Unreliable AMOs may be useful for applications that can tolerate intermittent AMO failures or those where the benefit of increased message rate outweighs by the cost of restarting after a failure.
Unreliable, non-fetching AMOs may be performed by specifying the FI_CXI_UNRELIABLE flag. Unreliable, fetching AMOs are not supported. Unreliable AMOs must target an optimized MR and cannot use remote completion notification. Unreliable AMOs are not ordered.
High Rate Put
High Rate Put (HRP) is a feature that increases message rate performance of RMA and unreliable non-fetching AMO operations at the expense of global ordering guarantees.
HRP responses are generated by the fabric egress port. Responses are coalesced by the fabric to achieve higher message rates. The completion event for an HRP operation guarantees delivery but does not guarantee global ordering. If global ordering is needed following an HRP operation, the source may follow the operation with a normal, fenced Put.
HRP RMA and unreliable AMO operations may be performed by specifying the FI_CXI_HRP flag. HRP AMOs must also use the FI_CXI_UNRELIABLE flag. Monitor the hardware counter C_CNTR_HNI_HRP_ACK at the initiator to validate that HRP is in use.
Counters
Cassini offloads light-weight counting events for certain types of operations. The rules for offloading are:
- Counting events for RMA and AMO source events are always offloaded.
- Counting events for RMA and AMO target events are always offloaded.
- Counting events for Sends are offloaded when message size is less than the rendezvous threshold.
- Counting events for message Receives are never offloaded by default.
Software progress is required to update counters unless the criteria for offloading are met.
RUNTIME PARAMETERS
The CXI provider checks for the following environment variables:
- FI_CXI_ODP
- Enables on-demand paging. If disabled, all DMA buffers are pinned. If enabled and mr_mode bits in the hints exclude FI_MR_ALLOCATED, then ODP mode will be used.
- FI_CXI_FORCE_ODP
- Experimental value that can be used to force the use of ODP mode even if FI_MR_ALLOCATED is set in the mr_mode hint bits. This is intended to be used primarily for testing.
- FI_CXI_ATS
- Enables PCIe ATS. If disabled, the NTA mechanism is used.
- FI_CXI_ATS_MLOCK_MODE
- Sets ATS mlock mode. The mlock() system call may be used in conjunction with ATS to help avoid network page faults. Valid values are “off” and “all”. When mlock mode is “off”, the provider does not use mlock(). An application using ATS without mlock() may experience network page faults, reducing network performance. When ats_mlock_mode is set to “all”, the provider uses mlockall() during initialization with ATS. mlockall() causes all mapped addresses to be locked in RAM at all times. This helps to avoid most network page faults. Using mlockall() may increase pressure on physical memory. Ignored when ODP is disabled.
- FI_CXI_RDZV_THRESHOLD
- Message size threshold for rendezvous protocol.
- FI_CXI_RDZV_GET_MIN
- Minimum rendezvous Get payload size. A Send with length less than or equal to FI_CXI_RDZV_THRESHOLD plus FI_CXI_RDZV_GET_MIN will be performed using the eager protocol. Larger Sends will be performed using the rendezvous protocol with FI_CXI_RDZV_THRESHOLD bytes of payload sent eagerly and the remainder of the payload read from the source using a Get. FI_CXI_RDZV_THRESHOLD plus FI_CXI_RDZV_GET_MIN must be less than or equal to FI_CXI_OFLOW_BUF_SIZE.
- FI_CXI_RDZV_EAGER_SIZE
- Eager data size for rendezvous protocol.
- FI_CXI_RDZV_PROTO
- Direct the provider to use a preferred protocol to transfer non-eager
rendezvous data.
FI_CXI_RDZV_PROTO= default | alt_read
To use an alternate protocol, the CXI driver property rdzv_get_en should be set to “0”. The “alt_read” rendezvous protocol may help improve collective operation performance. Note that all rendezvous protocol use RDMA to transfer eager and non-eager rendezvous data.
- FI_CXI_DISABLE_NON_INJECT_MSG_IDC
- Experimental option to disable favoring IDC for transmit of small messages when FI_INJECT is not specified. This can be useful with GPU source buffers to avoid the host copy in cases a performant copy can not be used. The default is to use IDC for all messages less than IDC size.
- FI_CXI_DISABLE_HOST_REGISTER
- Disable registration of host buffers (overflow and request) with GPU. There are scenarios where using a large number of processes per GPU results in page locking excessive amounts of memory degrading performance and/or restricting process counts. The default is to register buffers with the GPU.
- FI_CXI_OFLOW_BUF_SIZE
- Size of overflow buffers. Increasing the overflow buffer size allows for more unexpected message eager data to be held in single overflow buffer. The default size is 2MB.
- FI_CXI_OFLOW_BUF_MIN_POSTED/FI_CXI_OFLOW_BUF_COUNT
- The minimum number of overflow buffers that should be posted. The default minimum posted count is 3. Buffers will grow unbounded to support outstanding unexpected messages. Care should be taken to size appropriately based on job scale, size of eager data, and the amount of unexpected message traffic to reduce the need for flow control.
- FI_CXI_OFLOW_BUF_MAX_CACHED
- The maximum number of overflow buffers that will be cached. The default maximum count is 3 * FI_CXI_OFLOW_BUF_MIN_POSTED. A value of zero indicates that once a overflow buffer is allocated it will be cached and used as needed. A non-zero value can be used with bursty traffic to shrink the number of allocated buffers to the maximum count when they are no longer needed.
- *FI_CXI_SAFE_DEVMEM_COPY_THRESHOLD
- Defines the maximum CPU memcpy size for HMEM device memory that is accessible by the CPU with load/store operations.
- FI_CXI_OPTIMIZED_MRS
- Enables optimized memory regions. See section CXI Domain Control Extensions on how to enable/disable optimized MRs at the domain level instead of for the global process/job.
- FI_CXI_MR_MATCH_EVENTS
- Enabling MR match events in a client/server environment can be used
to ensure that memory backing a memory region cannot be remotely
accessed after the MR has been closed, even if it that memory remains
mapped in the libfabric MR cache. Manual progress must be made at the
target to process the MR match event accounting and avoid event queue
overflow. There is a slight additional cost in the creation and
tear-down of MR. This option is disabled by default.
See section CXI Domain Control Extensions on how to enable MR match events at the domain level instead of for the global process/job.
- FI_CXI_PROV_KEY_CACHE
- Enabled by default, the caching of remote MR provider keys can be
disable by setting to 0.
See section CXI Domain Control Extensions on how to disable the remote provider key cache at the domain level instead of for the global process/job.
- FI_CXI_LLRING_MODE
- Set the policy for use of the low-latency command queue ring mechanism. This mechanism improves the latency of command processing on an idle command queue. Valid values are idle, always, and never.
- FI_CXI_CQ_POLICY
- Experimental. Set Command Queue write-back policy. Valid values are always, high_empty, low_empty, and low. “always”, “high”, and “low” refer to the frequency of write-backs. “empty” refers to whether a write-back is performed when the queue becomes empty.
- FI_CXI_DEFAULT_VNI
- Default VNI value used only for service IDs where the VNI is not restricted.
- FI_CXI_RNR_MAX_TIMEOUT_US
- When using the endpoint FI_PROTO_CXI_RNR protocol, this setting is used to control the maximum time from the original posting of the message that the message should be retried. A value of 0 will return an error completion on the first RNR ack status.
- FI_CXI_EQ_ACK_BATCH_SIZE
- Number of EQ events to process before writing an acknowledgement to HW. Batching ACKs amortizes the cost of event acknowledgement over multiple network operations.
- FI_CXI_RX_MATCH_MODE
- Specify the receive message matching mode to be utilized.
FI_CXI_RX_MATCH_MODE=hardware | software | hybrid
hardware - Message matching is fully offloaded, if resources become exhausted flow control will be performed and existing unexpected message headers will be onloaded to free resources.
software - Message matching is fully onloaded.
hybrid - Message matching begins fully offloaded, if resources become exhuasted hardware will transition message matching to a hybrid of hardware and software matching.
For both “hybrid” and “software” modes and care should be taken to minimize the threshold for rendezvous processing (i.e. FI_CXI_RDZV_THRESHOLD + FI_CXI_RDZV_GET_MIN). When running in software endpoint mode the environment variables FI_CXI_REQ_BUF_SIZE and FI_CXI_REQ_BUF_MIN_POSTED are used to control the size and number of the eager request buffers posted to handle incoming unmatched messages.
- FI_CXI_HYBRID_PREEMPTIVE
- When in hybrid mode, this variable can be used to enable preemptive transitions to software matching. This is useful at scale for poorly written applications with a large number of unexpected messages where reserved resources may be insufficient to prevent to prevent starvation of software request list match entries. Default is 0, disabled.
- FI_CXI_HYBRID_RECV_PREEMPTIVE
- When in hybrid mode, this variable can be used to enable preemptive transitions to software matching. This is useful at scale for poorly written applications with a large number of unmatched posted receives where reserved resources may be insufficient to prevent starvation of software request list match entries. Default is 0, disabled.
- FI_CXI_HYBRID_POSTED_RECV_PREEMPTIVE
- When in hybrid mode, this variable can be used to enable preemptive transitions to software matching when the number of posted receives exceeds the user requested RX size attribute. This is useful for applications where they may not know the exact number of posted receives and they are expereincing application termination due to event queue overflow. Default is 0, disabled.
- FI_CXI_HYBRID_UNEXPECTED_MSG_PREEMPTIVE
- When in hybrid mode, this variable can be used to enable preemptive transitions to software matching when the number of hardware unexpected messages exceeds the user requested RX size attribute. This is useful for applications where they may not know the exact number of posted receives and they are expereincing application termination due to event queue overflow. Default is 0, disabled.
- FI_CXI_REQ_BUF_SIZE
- Size of request buffers. Increasing the request buffer size allows for more unmatched messages to be sent into a single request buffer. The default size is 2MB.
- FI_CXI_REQ_BUF_MIN_POSTED
- The minimum number of request buffers that should be posted. The default minimum posted count is 4. The number of buffers will grow unbounded to support outstanding unexpected messages. Care should be taken to size appropriately based on job scale and the size of eager data to reduce the need for flow control.
- FI_CXI_REQ_BUF_MAX_CACHED/FI_CXI_REQ_BUF_MAX_COUNT
- The maximum number of request buffers that will be cached. The default maximum count is 0. A value of zero indicates that once a request buffer is allocated it will be cached and used as needed. A non-zero value can be used with bursty traffic to shrink the number of allocated buffers to a maximum count when they are no longer needed.
- FI_CXI_MSG_LOSSLESS
- Enable or disable lossless receive matching. If hardware resources are exhausted, hardware will pause the associated traffic class until a overflow buffer (hardware match mode) or request buffer (software match mode or hybrid match mode) is posted. This is considered experimental and defaults to disabled.
- FI_CXI_FC_RETRY_USEC_DELAY
- Number of micro-seconds to sleep before retrying a dropped side-band, flow control message. Setting to zero will disable any sleep.
- FI_UNIVERSE_SIZE
- Defines the maximum number of processes that will be used by distribute OFI application. Note that this value is used in setting the default control EQ size, see FI_CXI_CTRL_RX_EQ_MAX_SIZE.
- FI_CXI_CTRL_RX_EQ_MAX_SIZE
- Max size of the receive event queue used for side-band/control messages. Default receive event queue size is based on FI_UNIVERSE_SIZE. Increasing the receive event queue size can help prevent side-band/control messages from being dropped and retried but at the cost of additional memory usage. Size is always aligned up to a 4KiB boundary.
- FI_CXI_DEFAULT_CQ_SIZE
- Change the provider default completion queue size expressed in entries. This may be useful for applications which rely on middleware, and middleware defaults the completion queue size to the provider default.
- FI_CXI_DISABLE_EQ_HUGETLB/FI_CXI_DISABLE_CQ_HUGETLB
- By default, the provider will attempt to allocate 2 MiB hugetlb pages for provider event queues. Disabling hugetlb support will cause the provider to fallback to memory allocators using host page sizes. FI_CXI_DISABLE_EQ_HUGETLB replaces FI_CXI_DISABLE_CQ_HUGETLB, however use of either is still supported.
- FI_CXI_DEFAULT_TX_SIZE
- Set the default tx_attr.size field to be used by the provider if the size is not specified in the user provided fi_info hints.
- FI_CXI_DEFAULT_RX_SIZE
- Set the default rx_attr.size field to be used by the provider if the size is not specified in the user provided fi_info hints.
- FI_CXI_SW_RX_TX_INIT_MAX
- Debug control to override the number of TX operations that can be outstanding that are initiated by software RX processing. It has no impact on hardware initiated RX rendezvous gets.
- FI_CXI_DEVICE_NAME
- Restrict CXI provider to specific CXI devices. Format is a comma separated list of CXI devices (e.g. cxi0,cxi1).
- FI_CXI_TELEMETRY
- Perform a telemetry delta between fi_domain open and close. Format is a comma separated list of telemetry files as defined in /sys/class/cxi/cxi*/device/telemetry/. The ALL-in-binary file in this directory is invalid. Note that these are per CXI interface counters and not per CXI process per interface counters.
- FI_CXI_TELEMETRY_RGID
- Resource group ID (RGID) to restrict the telemetry collection to. Value less than 0 is no restrictions.
- FI_CXI_CQ_FILL_PERCENT
- Fill percent of underlying hardware event queue used to determine when completion queue is saturated. A saturated completion queue results in the provider returning -FI_EAGAIN for data transfer and other related libfabric operations.
- FI_CXI_COMPAT
- Temporary compatibility to allow use of pre-upstream values for FI_ADDR_CXI and FI_PROTO_CXI. Compatibility can be disabled to verify operation with upstream constant values and to enable access to conflicting provider values. The default setting of 1 specifies both old and new constants are supported. A setting of 0 disables support for old constants and can be used to test that an application is compatible with the upstream values. A setting of 2 is a safety fallback that if used the provider will only export fi_info with old constants and will be incompatible with libfabric clients that been recompiled.
- FI_CXI_COLL_FABRIC_MGR_URL
- accelerated collectives: Specify the HTTPS address of the fabric manager REST API used to create specialized multicast trees for accelerated collectives. This parameter is REQUIRED for accelerated collectives, and is a fixed, system-dependent value.
- FI_CXI_COLL_TIMEOUT_USEC
- accelerated collectives: Specify the reduction engine timeout. This should be larger than the maximum expected compute cycle in repeated reductions, or acceleration can create incast congestion in the switches. The relative performance benefit of acceleration declines with increasing compute cycle time, dropping below one percent at 32 msec (32000). Using acceleration with compute cycles larger than 32 msec is not recommended except for experimental purposes. Default is 32 msec (32000), maximum is 20 sec (20000000).
- FI_CXI_COLL_USE_DMA_PUT
- accelerated collectives: Use DMA for collective packet put. This uses DMA to inject reduction packets rather than IDC, and is considered experimental. Default is false.
- FI_CXI_DISABLE_HMEM_DEV_REGISTER
- Disable registering HMEM device buffer for load/store access. Some HMEM devices (e.g. AMD, Nvidia, and Intel GPUs) support backing the device memory by the PCIe BAR. This enables software to perform load/stores to the device memory via the BAR instead of using device DMA engines. Direct load/store access may improve performance.
- FI_CXI_FORCE_ZE_HMEM_SUPPORT
- Force the enablement of ZE HMEM support. By default, the CXI provider will only support ZE memory registration if implicit scaling is disabled (i.e. the environment variables EnableImplicitScaling=0 NEOReadDebugKeys=1 are set). Set FI_CXI_FORCE_ZE_HMEM_SUPPORT to 1 will cause the CXI provider to skip the implicit scaling checks. GPU direct RDMA may or may not work in this case.
- FI_CXI_ENABLE_TRIG_OP_LIMIT
- Enable enforcement of triggered operation limit. Doing this can prevent fi_control(FI_QUEUE_WORK) deadlocking at the cost of performance.
- FI_CXI_MR_CACHE_EVENTS_DISABLE_POLL_NSECS
- Max amount of time to poll when disabling an MR configured with MR match events.
- FI_CXI_MR_CACHE_EVENTS_DISABLE_LE_POLL_NSECS
- Max amount of time to poll when LE invalidate disabling an MR configured with MR match events.
- FI_CXI_FORCE_DEV_REG_COPY
- Force the CXI provider to use the HMEM device register copy routines. If not supported, RDMA operations or memory registration will fail.
Note: Use the fi_info utility to query provider environment variables:
fi_info -p cxi -e
CXI EXTENSIONS
The CXI provider supports various fabric-specific extensions. Extensions are accessed using the fi_open_ops function.
CXI Domain Control Extensions
The fi_control() function is extended for domain FIDs to query and override global environment settings for a specific domain. This is useful for example where the application process also includes a client API that has different optimizations and protections.
Command FI_OPT_CXI_GET_OPTIMIZED where the argument is a pointer to a bool. The call returns the setting for optimized MR usage for the domain. The default is determined by the environment setting of FI_CXI_OPTIMIZED_MRS.
Command FI_OPT_CXI_SET_OPTIMIZED where the argument is a pointer to a bool initialized to true or false. The call enables or disables the use of optimized MRs for the domain. If the domain is not configured for FI_MR_PROV_KEY MR mode, the call will fail with -FI_EINVAL, it is not supported for client generated keys. It must be called prior to MR being created.
Command FI_OPT_CXI_GET_MR_MATCH_EVENTS where the argument is a pointer to a bool. The call returns the setting for MR Match Event accounting for the domain. The default is determined by the environment setting of FI_CXI_MR_MATCH_EVENTS.
Command FI_OPT_CXI_SET_MR_MATCH_EVENTS where the argument is a pointer to a bool initialized to true or false. This call enables or disables the use of MR Match Event counting. This ensures that memory backing a MR cannot be accessed after invoking fi_close() on the MR, even if that memory remains in the libfabric MR cache. Manual progress must be made to process events at the RMA destination. It can only be changed prior to any EP or MR being created.
Command FI_OPT_CXI_GET_PROV_KEY_CACHE where the argument is a pointer to a bool. The call returns the setting for enabling use of the remote MR cache for provider keys for the domain. The default is determined by the environment setting of FI_CXI_PROV_KEY_CACHE and is only valid if FI_MR_PROV_KEY MR mode is used.
Command FI_OPT_CXI_SET_PROV_KEY_CACHE where the argument is a pointer to a bool initialized to true or false. This call enables or disables the use of the remote MR cache for provider keys for the domain. By default the cache is enabled and can be used for provider keys that do not require events. The command will fail with -FI_EINVAL if FI_MR_PROV_KEY MR mode is not in use. It can only be changed prior to any MR being created.
CXI Domain Extensions
CXI domain extensions have been named FI_CXI_DOM_OPS_6. The flags parameter
is ignored. The fi_open_ops function takes a struct fi_cxi_dom_ops
. See an
example of usage below:
struct fi_cxi_dom_ops *dom_ops;
ret = fi_open_ops(&domain->fid, FI_CXI_DOM_OPS_4, 0, (void **)&dom_ops, NULL);
The following domain extensions are defined:
struct fi_cxi_dom_ops {
int (*cntr_read)(struct fid *fid, unsigned int cntr, uint64_t *value,
struct timespec *ts);
int (*topology)(struct fid *fid, unsigned int *group_id,
unsigned int *switch_id, unsigned int *port_id);
int (*enable_hybrid_mr_desc)(struct fid *fid, bool enable);
size_t (*ep_get_unexp_msgs)(struct fid_ep *fid_ep,
struct fi_cq_tagged_entry *entry,
size_t count, fi_addr_t *src_addr,
size_t *ux_count);
int (*get_dwq_depth)(struct fid *fid, size_t *depth);
};
cntr_read extension is used to read Cassini Telemetry items that consists of counters and gauges. The items available and their content are dependent upon the Cassini ASIC version and Cassini Driver version.
topology extension is used to return CXI NIC address topology information for the domain. Currently only a dragonfly fabric topology is reported.
The enablement of hybrid MR descriptor mode allows for libfabric users to optionally pass in a valid MR desc for local communications operations.
The get unexpected message function is used to obtain a list of unexpected messages associated with an endpoint. The list is returned as an array of CQ tagged entries set in the following manner:
struct fi_cq_tagged_entry {
.op_context = NULL,
.flags = any of [FI_TAGGED | FI_MSG | FI_REMOTE_CQ_DATA],
.len = message length,
.buf = NULL,
.data = CQ data if FI_REMOTE_CQ_DATA set
.tag = tag if FI_TAGGED set
};
If the src_addr or entry array is NULL, only the ux_count of available unexpected list entries will be returned. The parameter count specifies the size of the array provided, if it is 0 then only the ux_count will be returned. The function returns the number of entries written to the array or a negative errno. On successful return, ux_count will always be set to the total number of unexpected messages available.
enable_hybrid_mr_desc is used to enable hybrid MR descriptor mode. Hybrid MR desc allows for libfabric users to optionally pass in a valid MR desc for local communication operations. This is currently only used for RMA and AMO transfers.
get_dwq_depth is used to get the depth of the deferred work queue. The depth is the number of triggered operation commands which can be queued to hardware. The depth is not per fi_domain but rather per service ID. Since a single service ID is intended to be shared between all processing using the same NIC in a job step, the triggered operations are shared across processes.
enable_mr_match_events and enable_optimized_mrs have been deprecated in favor of using the fi_control() API. While the can be still be called via the domain ops, They will be removed from the domain opts prior to software release 2.2.
CXI Counter Extensions
CXI counter extensions have been named FI_CXI_COUNTER_OPS. The flags parameter
is ignored. The fi_open_ops function takes a struct fi_cxi_cntr_ops
. See an
example of usage below.
struct fi_cxi_cntr_ops *cntr_ops;
ret = fi_open_ops(&cntr->fid, FI_CXI_COUNTER_OPS, 0, (void **)&cntr_ops, NULL);
The following domain extensions are defined:
struct fi_cxi_cntr_ops {
/* Set the counter writeback address to a client provided address. */
int (*set_wb_buffer)(struct fid *fid, const void *buf, size_t len);
/* Get the counter MMIO region. */
int (*get_mmio_addr)(struct fid *fid, void **addr, size_t *len);
};
CXI Counter Writeback Flag
If a client is using the CXI counter extensions to define a counter writeback buffer, the CXI provider will not update the writeback buffer success or failure values for each hardware counter success or failure update. This can especially create issues when clients expect the completion of a deferred workqueue operation to generate a counter writeback. To support this, the flag FI_CXI_CNTR_WB can be used in conjunction with a deferred workqueue operation to force a writeback at the completion of the deferred workqueue operation. See an example of usage below.
struct fi_op_rma rma = {
/* Signal to the provider the completion of the RMA should trigger a
* writeback.
*/
.flags = FI_CXI_CNTR_WB,
};
struct fi_deferred_work rma_work = {
.op_type = FI_OP_READ,
.triggering_counter = cntr,
.completion_cntr = cntr,
.threshold = 1,
.op.rma = &rma,
};
ret = fi_control(&domain->fid, FI_QUEUE_WORK, &rma_work);
Note: Using FI_CXI_CNTR_WB will lead to additional hardware usage. To conserve hardware resources, it is recommended to only use the FI_CXI_CNTR_WB when a counter writeback is absolutely required.
CXI Alias EP Overrides
A transmit alias endpoint can be created and configured to utilize a different traffic class than the original endpoint. This provides a lightweight mechanism to utilize multiple traffic classes within a process. Message order between the original endpoint and the alias endpoint is not defined/guaranteed. See example usage below for setting the traffic class of a transmit alias endpoint.
#include <rdma/fabric.h>
#include <rdma/fi_endpoint.h>
#include <rdma/fi_cxi_ext.h> // Ultimately fi_ext.h
struct fid_ep *ep;
. . .
struct fid_ep *alias_ep = NULL;
uint32_t tclass = FI_TC_LOW_LATENCY;
uint64_t op_flags = FI_TRANSMIT | desired data operation flags;
ret = fi_ep_alias(ep, &alias_ep, op_flags);
if (ret)
error;
ret = fi_set_val(&alias_ep->fid, FI_OPT_CXI_SET_TCLASS, (void *)&tlcass);
if (ret)
error;
In addition, the alias endpoint message order may be modified to override the default endpoint message order. Message order between the modified alias endpoint and the original endpoint is not guaranteed. See example usage below for setting the traffic class of a transmit alias endpoint.
uint64_t msg_order = FI_ORDER_RMA_WAW;
ret = fi_set_val(&alias_ep->fid, FI_OPT_CXI_SET_MSG_ORDER,
(void *)&msg_order);
if (ret)
error;
When an endpoint does not support FI_FENCE (e.g. optimized MR), a provider specific transmit flag, FI_CXI_WEAK_FENCE, may be specified on an alias EP to issue a FENCE operation to create a data ordering point for the alias. This is supported for one-sided operations only.
Alias EP must be closed prior to closing the original EP.
PCIe Atomics
The CXI provider has the ability to issue a given libfabric atomic memory operation as a PCIe operation as compared to a NIC operation. The CXI provider extension flag FI_CXI_PCIE_AMO is used to signify this.
Since not all libfabric atomic memory operations can be executed as a PCIe
atomic memory operation, fi_query_atomic()
could be used to query if a
given libfabric atomic memory operation could be executed as PCIe atomic
memory operation.
The following is a query to see if a given libfabric operation can be a PCIe atomic operation.
int ret;
struct fi_atomic_attr out_attrs;
/* Query if non-fetching PCIe atomic is supported. */
ret = fi_query_atomic(domain, FI_UINT32, FI_SUM, &out_attrs, FI_CXI_PCIE_AMO);
/* Query if fetching PCIe atomic is supported. */
ret = fi_query_atomic(domain, FI_UINT32, FI_SUM, &out_attrs,
FI_FETCH_ATOMIC | FI_CXI_PCIE_AMO);
The following is how to issue a PCIe atomic operation.
ssize_t ret;
struct fi_msg_atomic msg;
struct fi_ioc resultv;
void *result_desc;
size_t result_count;
ret = fi_fetch_atomicmsg(ep, &msg, &resultv, &result_desc, result_count,
FI_CXI_PCIE_AMO);
Note: The CXI provider only supports PCIe fetch add for UINT32_T, INT32_t, UINT64_T, and INT64_t. This support requires enablement of PCIe fetch add in the CXI driver, and it comes at the cost of losing NIC atomic support for another libfabric atomic operation.
Note: Ordering between PCIe atomic operations and NIC atomic/RMA operations is undefined.
To enable PCIe fetch add for libfabric, the following CXI driver kernel module parameter must be set to non-zero.
/sys/module/cxi_ss1/parameters/amo_remap_to_pcie_fadd
The following are the possible values for this kernel module and the impact of each value:
- -1: Disable PCIe fetch add support. FI_CXI_PCIE_AMO is not supported.
- 0: Enable PCIe fetch add support. FI_MIN is not supported.
- 1: Enable PCIe fetch add support. FI_MAX is not supported.
- 2: Enable PCIe fetch add support. FI_SUM is not supported.
- 4: Enable PCIe fetch add support. FI_LOR is not supported.
- 5: Enable PCIe fetch add support. FI_LAND is not supported.
- 6: Enable PCIe fetch add support. FI_BOR is not supported.
- 7: Enable PCIe fetch add support. FI_BAND is not supported.
- 8: Enable PCIe fetch add support. FI_LXOR is not supported.
- 9: Enable PCIe fetch add support. FI_BXOR is not supported.
- 10: Enable PCIe fetch add support. No loss of default CXI provider AMO functionality.
Guidance is to default amo_remap_to_pcie_fadd to 10.
FABTESTS
The CXI provider does not currently support fabtests which depend on IP addressing.
fabtest RDM benchmarks are supported, like:
# Start server by specifying source PID and interface
./fabtests/benchmarks/fi_rdm_tagged_pingpong -B 10 -s cxi0
# Read server NIC address
CXI0_ADDR=$(cat /sys/class/cxi/cxi0/device/properties/nic_addr)
# Start client by specifying server PID and NIC address
./fabtests/benchmarks/fi_rdm_tagged_pingpong -P 10 $CXI0_ADDR
# The client may be bound to a specific interface, like:
./fabtests/benchmarks/fi_rdm_tagged_pingpong -B 10 -s cxi1 -P 10 $CXI0_ADDR
Some functional fabtests are supported (including fi_bw). Others use IP sockets and are not yet supported.
multinode fabtests are not yet supported.
ubertest is supported for test configs matching the provider’s current capabilities.
unit tests are supported where the test feature set matches the CXI provider’s current capabilities.
ERRATA
- Fetch and compare type AMOs with FI_DELIVERY_COMPLETE or FI_MATCH_COMPLETE completion semantics are not supported with FI_RMA_EVENT.
Libfabric CXI Provider User Programming and Troubleshooting Guide
The scope of the following subsection is to provide guidance and/or troubleshooting tips for users of the libfabric CXI provider. The scope of this section is not a full guide for user libfabric.
Sizing Libfabric Objects Based on Expected Usage
The CXI provider uses various libfabric object attribute size and/or libfabric enviroment variables to size hardware related resources accordingly. Failure to size resources properly can result in the CXI provider frequently returning -FI_EAGAIN which may negatively impact performance. The following subsection outline important sizing related attributes and environment variables.
Completion Queue Size Attribute
The CXI provider uses completion queue attribute size to size various software and hardware event queues used to generate libfabric completion events. While the size of the software queues may grow, hardware event queue sizes are static. Failing to size hardware queues properly may result in CXI provider returning -FI_EAGAIN frequently for data transfer operations. When this error is returned, user should progress the corresponding endpoint completion queues by calling fi_cq_read().
Users are encouraged to set the completion queue size attribute based on the expected number of inflight RDMA operations to and from a single endpoint. For users which are relying on the provider default value (e.g. MPI), the FI_CXI_DEFAULT_CQ_SIZE environment variable can be used to override the provider default value.
Endpoint Recieve Size Attribute
The CXI provider uses the endpoint receive size attribute to size internal command and hardware event queues. Failing to size the either command queue correctly can result in the CXI provider returning -FI_EAGAIN frequently for data transfer operations. When this error is returned, user should progress the corresponding endpoint completion queues by calling fi_cq_read().
Users are encouraged to set the endpoint receive size attribute based on the expected numbfer of inflight untagged and tagged RDMA operations. For users which are relying on the provider default value (e.g. MPI), the FI_CXI_DEFAULT_RX_SIZE environment variable can be used to override the provider default value.
Endpoint Transmit Size Attribute
The CXI provider uses the endpoint transmit size attribute to size internal command and hardware event queues. Failing to size the either command queue correctly can result in the CXI provider returning -FI_EAGAIN frequently for data transfer operations. When this error is returned, user should progress the corresponding endpoint completion queues by calling fi_cq_read().
At a minimum, users are encouraged to set the endpoint transmit size attribute based on the expected numbfer of inflight, initiator RDMA operations. If users are going to be issuing message opeartions over the CXI provider rendezvous limit (FI_CXI_RDZV_THRESHOLD), the transmit size attribute must also include the number of outstanding, unexpected rendezvous operations (i.e. inflight, initiator RDMA operations + outstanding, unexpected rendezvous operations).
For users which are relying on the provider default value (e.g. MPI), the FI_CXI_DEFAULT_TX_SIZE environment variable can be used to override the provider default value.
FI_UNIVERSE_SIZE Environment Variable
The libfabric FI_UNIVERSE_SIZE environment variable defines the number of expected ranks/peers an application needs to communicate with. The CXI provider may use this environment variable to size resources tied to number of peers. Users are encourage to set this environment variable accordingly.
Selecting Proper Receive Match Mode
As mentioned in the Runtime Parameters section, the CXI provider supports 3 different operational modes: hardware, hybrid, and software.
Hardware match mode is approriate for users who can ensure the sum of unexpected messages and posted receives does not exceed the configured hardware receive resource limit for the application. When resources are consumed, the endpoint will transition into a flow control operational mode which requires side-band messaging to recover from. Recovery will involve the CXI provider trying to reclaim hardware receive resources to help prevent future transition into flow control. If the CXI provider is unable to reclaim hardware receive resoures, this can lead to a cycle of entering and exiting flow control which may present itself as a hang to the libfabric user. Running with FI_LOG_LEVEL=warn and FI_LOG_PROV=cxi will report if this flow control transition is happening.
Hybrid match mode is approriate for users who are unsure if the sum of unexpected messages and posted receives will not exceed the configure hardware receive resource limit for the application but want to ensure they application still functions if hardware receive resources are consumed. Hybrid match mode extends hardware match by allowing for an automated transition into software match mode if resources are consumed.
Sofftware match mode is approriate for user who know the sum of unexpected messages and posted receives will exceed the configured hardware receive resource limit for the application. In software match mode, the CXI provider maintains the a software unexpected and posted receive list rather than offloading to hardware. This avoids having to allocated a hardware receive resource for each unxpected messsage and posted receive.
Note: In practice, dependent processes (e.g. parallel job) will most likely be sharing a recieve hardware resource pool.
Note: Each match mode may still enter flow control. For example, if a user is not draining the libfabric completion queue at a reasonable rate, corresponding hardware events may fill up which will trigger flow control.
Using Hybrid Match Mode Preemptive Options
The high-level objective of the hybrid match mode preemptive environment variables (i.e. FI_CXI_HYBRID_PREEMPTIVE, FI_CXI_HYBRID_RECV_PREEMPTIVE, FI_CXI_HYBRID_POSTED_RECV_PREEMPTIVE, and FI_CXI_HYBRID_UNEXPECTED_MSG_PREEMPTIVE) is to ensure a process requiring more hardware receives resource does not force other process requiring less hardware receive resource to be force into software match mode due to no available hardware receive resources available.
For example, considered a parallel application which has multiple processes (i.e. ranks) per NIC all sharing the same hardware receive resource pool. Suppose that the application communication pattern results in an all-to-one communication to only a single rank (e.g. rank 0) while other ranks may be doing communication amongst each other. If the width of the all-to-one exceeds hardware resource consumptions, all ranks on the target NIC will transition to software match mode. The preemptive options may help ensure that only rank 0 would transition to software match mode instead of all the ranks on the target NIC.
The FI_CXI_HYBRID_POSTED_RECV_PREEMPTIVE and FI_CXI_HYBRID_UNEXPECTED_MSG_PREEMPTIVE environment variables will force the transition to software match mode if the user requested endpoint recieve size attribute is exceeded. The benefit of running with these enabled is that software match mode transition is 100% in control of the libfabric user through the receive size attribute. One approach users could take here is set receive size attribute to expected usage, and if this expected usage is exceeded, only the offending endpoints will transition to software match mode.
FI_CXI_HYBRID_PREEMPTIVE and FI_CXI_HYBRID_RECV_PREEMPTIVE environment variables will force the transition to software match mode if hardware receive resources in the pool are running low. The CXI provider will do a multi-step process to transition the libfabric endpoint to software match mode. The benefit of running with these enabled is that the number of endpoints transitioning to software match mode may be smaller when compared to forced software match mode transition due to zero hardware resources available.
Preventing Messaging Flow Control Due to Hardware Event Queue Sizing
As much as possible, CXI provider message flow control should be avoided. Flow control results in expensive, side-band, CXI provider internal messaging to recover from. One cause for flow control is due to improper hardware event queue sizing. If the hardware event queue is undersized resulting it filling quicker than expected, the next incoming message operation targeting a full event queue will result in the message operation being dropped and flow control triggered.
The default CXI provider behavior is to size hardware event queues based on endpoint transmit and receive size attributes. Thus, it is critical for users to set these attributes accordingly.
The CQ size can be used to override the CXI provider calcuatled hardware event queue size based on endpoint transmit and receive size attributes. If the CQ size is greater than the CXI proviuder calcuation, the value from the CQ size will be used.
The CQ fill percent can be used to define a threshold for when no new RDMA operations can be queued until the libfabric CQ a progressed thus draining hardware event queues.
Interrupting CXI Provider CQ Error Event Errno
The following are the libfabric errno value which may be returned in an RDMA CQ error event.
FI_ETRUNC: Receive message truncation.
FI_EHOSTUNREACH: Target is unreachable. This is due to connectivity issues, such as downed links, between the two peers.
FI_ENOTCONN: Cannot communicate due to no libfabric endpoint configure. In this case, the target NIC is reachable.
FI_EIO: Catch all errno.