Nothing Special   »   [go: up one dir, main page]

WO2024201804A1 - Relay device - Google Patents

Relay device Download PDF

Info

Publication number
WO2024201804A1
WO2024201804A1 PCT/JP2023/012870 JP2023012870W WO2024201804A1 WO 2024201804 A1 WO2024201804 A1 WO 2024201804A1 JP 2023012870 W JP2023012870 W JP 2023012870W WO 2024201804 A1 WO2024201804 A1 WO 2024201804A1
Authority
WO
WIPO (PCT)
Prior art keywords
packet
relay device
transfer destination
remote
computer
Prior art date
Application number
PCT/JP2023/012870
Other languages
French (fr)
Japanese (ja)
Inventor
顕至 田仲
勇輝 有川
勇介 村中
健 坂本
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to PCT/JP2023/012870 priority Critical patent/WO2024201804A1/en
Publication of WO2024201804A1 publication Critical patent/WO2024201804A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L12/00Data switching networks
    • H04L12/28Data switching networks characterised by path configuration, e.g. LAN [Local Area Networks] or WAN [Wide Area Networks]
    • H04L12/46Interconnection of networks

Definitions

  • the present invention relates to a relay device, such as a network switch, that relays data transferred via RDMA (Remote Direct Memory Access).
  • RDMA Remote Direct Memory Access
  • RDMA is a communications protocol.
  • a local computer issues a request to establish a connection with a remote computer.
  • the remote computer then transfers data stored in its own memory directly to the remote computer's main memory (Non-Patent Document 1).
  • the local computer and remote computer are determined on a one-to-one basis when a connection is established. For this reason, after data is output from the local computer, it is not possible to transfer the data to one of multiple candidate transfer destinations. If data output from the local computer could be transferred to one of multiple candidate transfer destinations, there would be an advantage in that, for example, data could be transferred while avoiding busy remote computers among multiple remote computers.
  • the objective of the present invention is to enable RDMA transfer of data output from a local computer to one of multiple potential transfer destinations.
  • the relay device of the present invention is a relay device that relays data transferred by RDMA (Remote Direct Memory Access), and includes an information acquisition unit that acquires multiple pieces of connection information issued by multiple remote computers when each of the multiple remote computers receives the data by RDMA transfer, a transfer destination selection unit that selects a remote computer to which the packet is to be transferred from the multiple remote computers as a transfer destination computer when a packet containing the entire data or a divided part of the data is received from a local computer, and a packet output unit that includes the connection information issued by the destination computer from the multiple connection information acquired by the information acquisition unit in the packet and outputs the packet with the connection information included, thereby transferring the packet to the destination computer by RDMA transfer.
  • RDMA Remote Direct Memory Access
  • data output from a local computer can be RDMA transferred to one of multiple potential transfer destinations.
  • FIG. 1 is a configuration diagram of a system including a relay device according to a first embodiment of the present invention.
  • FIG. 2 is a hardware configuration diagram of the local and remote computers in FIG.
  • FIG. 3 is a block diagram of the relay device of FIG.
  • FIG. 4 is a flowchart for explaining the process executed by the relay device of FIG.
  • FIG. 5 is a diagram illustrating an example of the configuration of the management table.
  • FIG. 6 is a block diagram of a relay device according to the second embodiment.
  • FIG. 7 is a diagram showing an example of the configuration of a PSN table.
  • FIG. 8 is a block diagram of a relay device according to the third embodiment.
  • FIG. 9 is a block diagram of a relay device according to the third embodiment.
  • FIG. 10 is a configuration diagram of a system including a relay device according to the fourth embodiment.
  • the relay device 10 is communicatively connected to first and second local computers (also simply referred to as local) 91 and 92 as transfer sources in RDMA (Remote Direct Memory Access) via a network NW such as the Internet and a router (not shown).
  • the relay device 10 is further communicatively connected by wire or wireless to first and second remote computers (also simply referred to as remote) 93 and 94 as transfer destinations in RDMA.
  • the relay device 10 is configured as a network switch.
  • the remotes 93 and 94 are configured to be able to execute the same processing, realizing processing redundancy.
  • the relay device 10 transmits data to be processed from the local 91 or 92 to either the remote 93 or 94.
  • the data to be processed is transmitted in a packet including the data. If the data to be processed is long, it is divided into multiple pieces of data and transmitted in packets including each of the divided pieces of data.
  • the remotes 93 and 94 may be devices that perform distributed processing of big data transmitted from the locals 91 and 92. In this case, the data transmitted by the locals 91 and 92 may be transmitted to either the remote 93 or 94.
  • the locals 91 and 92 and the remotes 93 and 94 in FIG. 1 are also collectively referred to as computers 90.
  • Each of the locals 91 and 92 and the remotes 93 and 94, that is, each computer 90, has the configuration shown in FIG. 2.
  • the computer 90 comprises a CPU (Central Processing Unit) 90A, a main memory 90B, and a non-volatile storage device 90C that stores an OS (Operating System) executed by the CPU 90A, user applications, and other programs.
  • the computer 90 further comprises a communication interface 90D that is connected to the outside.
  • the communication interface 90D includes, for example, an R-NIC (RDMA-Network Interface Card) 90E as an HCA (Host Channel Adapter) for realizing RDMA transfers.
  • the communication interface 90D may include multiple R-NICs 90E.
  • the R-NIC90E has multiple QPs (Queue Pairs), each consisting of an SQ (Send Queue) and an RQ (Receive Queue), that are used when performing RDMA transfers.
  • the communication unit for RDMA transfers is a communication request called a WR (Work Request), and the data to be communicated is stacked in the SQ/RQ as a WQE (Work Queue Element).
  • the R-NIC90E also has a Completion Queue (CQ) corresponding to each SQ/RQ, where the WQEs stacked in the SQ/RQ are stacked as Completion Queue Entries (CQEs) when a WR between QPs is completed.
  • CQ Completion Queue
  • the relay device 10 includes an information acquisition unit 11, a forwarding selection unit 12, a packet output unit 13, and a memory unit 14.
  • the above units 11 to 14 are configured as a whole in at least one of an ASIC (Application Specific Integrated Circuit) and an FPGA (Field-Programmable Gate Array).
  • the entire units 11 to 14 may be configured in either an ASIC or an FPGA, or a portion of the units 11 to 14 may be configured in either an ASIC or an FPGA, and the remainder of the units 11 to 14 may be configured in the other of an ASIC or an FPGA.
  • each of the above-mentioned units 11 to 14 of the relay device 10 will be explained with reference to FIG. 4 along with the flow of RDMA transfer.
  • the RDMA transfer will be explained as a transfer in which data prepared in the main memory 90B of the local 91 is written to the main memory 90B of either the remote 93 or 94.
  • the CPU 90A of the local 91 issues a request for establishing a connection for RDMA transfer and establishes a connection with the relay device 10 (step S11).
  • the information acquisition unit 11 of the relay device 10 establishes connections with each of the remotes R1 and R2 (steps S12 and S13).
  • each CPU 90A provided in the local 91 and the remotes 93 and 94 determines the memory area for RDMA transfer in the main memory 90B (the local side is the memory area that stores the transmitted data, and the remote side is the memory area that stores the received data).
  • Each CPU 90A also determines which of the multiple QPs of the R-NIC 90E to use.
  • Each CPU 90A generates connection information including the results of these determinations and transmits it to the relay device 10.
  • the transmitted connection information is acquired by the information acquisition unit 11 of the relay device 10.
  • connection information includes, for example, the following information: (1) QP number: A number that identifies the QP used in this RDMA transfer. (2) Information about the memory area used in this RDMA transfer. Key: A key for accessing a memory area (address) in the main memory where data to be transferred via RDMA is stored or read. addr: address of the registered memory area; buffer size: size of the registered memory area. (3) Global ID and local ID of the R-NIC used for RDMA, and global ID and local ID of the computer having this R-NIC.
  • the information acquisition unit 11 of the relay device 10 registers the acquired connection information of the local 91, the connection information of the remote 93, and the connection information of the remote 94 in the management table (see FIG. 5) of the storage unit 14 in association with each other (step S14).
  • data to be RDMA transferred is transferred to one of multiple transfer destination candidates (here, remotes 93 and 94).
  • the management table is provided to manage the data transfer destination candidates. As shown in FIG. 5, the management table registers the connection information of the transfer source (such as the local 91) and each connection information of the transfer destination candidates (remote 93 and 94) in association with each other for each unit of RDMA transfer.
  • the connection information may further include information on the application that manages the connection.
  • the information acquisition unit 11 of the relay device 10 returns a registration notification of the connection information to the local 91 (step S15).
  • the local 91 receives the registration notification of the connection information and performs an RDMA transfer (step S16). That is, the R-NIC 90E of the local 91 reads the transmission data from the main memory 90B, generates a packet including this transmission data in the payload, and transmits it to the relay device 10. At this time, the R-NIC 90E includes the connection information on the local 91 side in the packet.
  • the position of the connection information in the packet is not limited to the header and can be any position.
  • the packet from the local 91 is received by the forwarding destination selection unit 12 of the relay device 10.
  • the forwarding destination selection unit 12 selects the forwarding destination of the received packet (step S17).
  • step S17 An example of step S17 is described below.
  • the destination selection unit 12 extracts the connection information of the source (local 91) from the header of the first packet, and refers to the management table ( Figure 5) based on the extracted connection information of the source.
  • the destination selection unit 12 acquires from the management table a number of pieces of connection information corresponding to the connection information of the source.
  • the destination selection unit 12 recognizes a number of remotes, here remotes 93 and 94, as destination candidates, each of which is identified by a global ID or the like contained in the acquired number of connection information.
  • the destination selection unit 12 selects one of the recognized destination candidates as the destination.
  • the method of selecting the transfer destination is arbitrary.
  • the transfer destination is selected, for example, taking into consideration the communication status between the remotes 93 and 94 and the relay device 10.
  • the communication status includes the network status and the busy state of the remotes.
  • the transfer destination selection unit 12 measures the time required for communication with the remote 93 (i.e., the time from the transmission timing of the specified data to the reception timing of the response) and the time required for communication with the remote 94, and selects the remote with the shortest time of the two measured times as the transfer destination. Any communication is adopted as the communication. For example, communication for establishing a connection may be adopted, or communication performed exclusively to grasp the communication status may be adopted.
  • the remote with a communication status better than a predetermined standard is selected as the transfer destination.
  • the transfer destination may be selected randomly.
  • the transfer destination selection unit 12 may count the number of times RDMA communication has been performed for each remote, and the remote with the smallest count value may be the transfer destination.
  • the transfer destination selection unit 12 may also select the remotes in order.
  • the remote may be selected by weighting each connection information in the management table and then selecting the connection information. In this way, load balancing may be performed when selecting the transfer destination.
  • the packet output unit 13 writes the connection information of the destination selected by the destination selection unit 12 into the header of the packet (step S18).
  • the remote 93 has been determined as the destination.
  • the packet output unit 13 writes the connection information of the remote 93 registered in the management table, which corresponds to the connection information sent this time, into the header.
  • the packet output unit 13 outputs the packet after writing.
  • the output packet is RDMA transferred to the remote 93 according to the connection information (step S19).
  • the data transferred by RDMA from the local 91 may be divided into multiple packets and sent.
  • steps S16 to S19 are repeated the number of times equal to the number of packets.
  • step S17 the same destination as the first packet is selected. This causes a series of multiple packets to be transferred to the same remote, and even the same R-NIC. Whether or not the data is being sent in multiple packets can be determined from the data length included in the packet header, etc.
  • the R-NIC 90E of the remote 93 references the connection information in the packet's header and performs processing for RDMA transfer.
  • the packet's payload data is stored as a WQE in the RQ of the QP (specified by (1) above) provided in the R-NIC 90E (specified by (3) above), and then transferred to the memory area of the main memory 90B (specified by (2) above).
  • the WQE is stored as a CQE in the CQ corresponding to the RQ, and the packet stored in the RQ is released.
  • the remote 93 sends a response to the local 91, which is the source of the transfer, via the relay device 10 to notify that the transfer has ended (steps S20 and S21).
  • the local 91 and the relay device 10 disconnect the connection (step S22).
  • the relay device 10 deletes the record including the connection information used this time from the management table (step S23), and instructs the remote devices 93 and 94 to release the reserved memory area, etc. (steps S24 and S25).
  • both remotes 93 and 94 issue connection information, but the connection information generated by the remote that was not selected as the transfer destination is not used.
  • the connection information is information issued by each of remotes 93 and 94 when the remotes 93 and 94 receive the data to be transferred by RDMA transfer.
  • the information acquisition unit 11 acquires such connection information.
  • the transfer destination selection unit 12 receives a packet containing the entire data to be transferred or a divided part of the data from the local 91 or 92, it selects the remote to which the packet is to be transferred from the remotes 93 and 94 as candidate transfer destinations as the transfer destination remote.
  • the packet output unit 13 includes the connection information issued by the transfer destination remote among the multiple connection information items acquired by the information acquisition unit 11 in the header of the packet and outputs the included packet.
  • the connection information included in the header is information necessary to realize the RDMA transfer to the transfer destination remote (more specifically, information specifying the transfer destination of the data), so the packet output from the packet output unit 13 is RDMA transferred to the transfer destination remote.
  • the data transfer destination is determined by the relay device 10, so data output from the local 91 or 92 can be RDMA transferred to one of multiple transfer destination candidates.
  • the transfer destination selection unit 12 selects the transfer destination remote by load balancing, thereby achieving load balancing of the transfer destination for RDMA transfer.
  • the relay device 10 is a network switch, the load on the R-NIC 90E or CPU 90A is prevented from being placed on the selection of the destination. Furthermore, the information acquisition unit 11, the destination selection unit 12, and the packet output unit 13 are configured as an ASIC or an FPGA as a whole, which results in lower latency, higher throughput, and higher power efficiency than if the relay device 10 were a server computer.
  • a PSN Packet Sequence Number
  • the PSN may be the same as that used in conventional RDMA.
  • the PSN is incremented each time a packet is transmitted or received.
  • the PSN of each packet becomes a consecutive number.
  • the PSN becomes a cumulative value.
  • the header of a packet of data transferred from a QP is given the PSN of each QP of the source and destination.
  • the following issues are based on the premise that data to be transferred via RDMA is divided into multiple packets and transferred.
  • the remotes 93 and 94 may count the PSN each time they receive a packet, and include the count value at the time of response in the response packet. In this case, if another RDMA transfer packet arrives at the remote before the response (ACK) corresponding to the last packet, the remote behaves as follows. The following "write req" corresponds to the above RDMA transferred packet.
  • a PSN table is provided in the storage unit 14 as shown in FIG.
  • the configuration example of the PSN table is shown in FIG. 7.
  • the local ID is information for identifying the local, and here, the same value as the code (91 or 92) attached to the local is used.
  • the state can be "Sending”, which indicates that data is being sent in the middle of being sent when multiple packets are sent, "Acked”, which indicates that a response has been received, or "Idle”, which indicates that RDMA transfer is not being performed.
  • the transfer destination remote ID is information for identifying the remote selected as the transfer destination, and here, the same value as the code (93 or 94) attached to the remote is used.
  • the local start PSN is the PSN of the local side when the first packet is transferred, and the local end PSN is the PSN of the local side when the last packet is transferred.
  • the remote start PSN is the PSN of the remote side when the first packet is transferred, and the remote end PSN is the PSN of the remote side when the last packet is transferred.
  • the PSN table has multiple rows of information storage areas for each local, and each is set to "Idle".
  • the destination selection unit 12 compares the PSN contained in the header of a packet that the relay device 10 receives from the local 91 or 92 with each range from the local start PSN to the local end PSN in the PSN table. If the PSN contained in the header does not fall within any of the ranges, the packet is usually the first packet of multiple packets when a single piece of data is divided and transferred. In this case, the destination selection unit 12 selects the destination on the condition that the header of the packet contains information indicating that the packet is the first, and obtains the above information based on the information in the header of the packet, etc.
  • the transfer destination selection unit 12 sets the PSN included in the header of the first packet as the local start PSN.
  • the transfer destination selection unit 12 can obtain the local end PSN by dividing the data length included in the header of the first packet by the data length of one packet and adding the obtained value to the local start PSN. Furthermore, when selecting a transfer destination, the transfer destination selection unit 12 communicates with the selected remote and obtains the current PSN from the remote. The transfer destination selection unit 12 adds 1 to the obtained PSN and sets the result as the remote start PSN. The transfer destination selection unit 12 adds the value added to the local start PSN to the remote start PSN and sets the added value as the remote start PSN.
  • the destination selection unit 12 registers each piece of acquired information in the "Idle” column. If there is no "Idle”, the destination selection unit 12 overwrites and registers each piece of information in the oldest "Acked” (for example, the one with the smallest PSN value). The state at this time is "Sending”. If there is no "Idle” or "Acked”, the destination selection unit 12 drops the packet. The destination selection unit 12 also drops the packet if the packet header does not contain information indicating that the packet is the first.
  • the destination selection unit 12 determines that the remote identified by the destination remote ID corresponding to that range is the destination of the packet. The destination selection unit 12 then adds the value obtained by subtracting the local start PSN from the PSN included in the header to the remote start PSN. The destination selection unit 12 writes this added value in the packet header as the PSN of the remote side. This ensures that when the packet is sent to the remote side by the packet output unit 13, the PSN of the packet and the PSN of the remote side will be consistent.
  • the destination selection unit 12 compares the PSN contained in the header of a packet that the relay device 10 receives from the remote 93 or 94 with each range from the remote start PSN to the remote end PSN in the PSN table. If the PSN contained in the header is not within any of the ranges, the destination selection unit 12 drops the packet. If the PSN is within any of the ranges, the destination selection unit 12 adds a value obtained by subtracting the remote start PSN from the PSN contained in the header to the local start PSN. The destination selection unit 12 writes this added value in the header of the packet as the local side PSN. As a result, when the packet is sent to the local side by the packet output unit 13, the PSN of the packet and the local PSN will be consistent.
  • the correct range of PSNs (start PSN to end PSN) contained in a packet is determined for each RDMA transfer by the PSN table, so the inconveniences described in issues 1 and 2 above do not occur. Also, because each packet is compared with the PSN range, there is no problem even if a rollback occurs. Note that the number of rows in the PSN table should be, for example, at least twice as many as the maximum number of anticipated RDMA transfers.
  • a packet includes a PSN that specifies the number of packets sent by the local 91 or the like, and the packet output unit 13 outputs a packet if the PSN included in the packet falls within any of the PSN ranges (PSN table) defined for each RDMA transfer.
  • PSN table PSN ranges
  • the destination selection unit 12 refers to a PSN table that indicates the correspondence between each of the ranges and the remote that is the destination, and when the PSN included in the non-leading packet falls within any of the ranges, selects the remote that corresponds to that range as the destination. This causes the non-leading packet to be transferred to the same destination as the leading packet.
  • the connection information issued by the remote may be issued in units of QP.
  • the transfer destination selection unit 12 selects the QP of the transfer destination when selecting the transfer destination remote, and the packet output unit 13 includes in the packet the connection information issued by the transfer destination remote for the QP selected by the transfer destination selection unit 12.
  • the relay device 10 functions as a switch or load balancer for the virtual environment.
  • the connection information issued by the remote may be issued in units of memory area (e.g., the key and addr in (2) above) of the main memory 90B to which the data is to be transferred.
  • the transfer destination selection unit 12 selects the memory area of the transfer destination when selecting the transfer destination remote, and the packet output unit 13 includes in the packet the connection information issued by the transfer destination remote for the memory area selected by the transfer destination selection unit 12.
  • the relay device 10 functions as a switch or load balancer for the device memory area.
  • this embodiment makes it possible to switch between applications and even load balance from outside the remote. Furthermore, load balancing ensures the fairness of the switched QPs, and the bandwidth of each application is approximately the same.
  • the selection targets by the destination selection unit 12 may be multiple QPs or memory areas within one remote.
  • the relay device 10 may configure a large-scale cluster in a fat tree together with relay devices 101 to 103 such as other switches.
  • the relay device 10 in FIG. 8 also includes the above-mentioned units 11 to 13.
  • the packet output unit 13 of the relay device 10 may detect a malfunction of the transmission path (congestion, disconnection, etc.) when transferring a packet from the local 91 or 92 to the remote 93 or 94 that is the RDMA transfer destination.
  • the path (dotted line) between the relay device 10 and the relay device 102 is malfunctioning.
  • the packet output unit 13 detects the malfunction of the path by a known method, and rewrites the header of the packet so that the packet is transmitted via another transmission path that bypasses the communication path, that is, the transmission path of the relay device 10 ⁇ relay device 101 ⁇ relay device 103 ⁇ remote 93 or 94.
  • the detour may be set in advance for each malfunctioning path, for example.
  • congestion detection and other functions were performed by the sending side, which placed a load on the sending side.
  • congestion detection and other functions are performed by the relay device (especially in the case of a switch that uses an ASIC or FPGA), which reduces the load, reduces latency, increases throughput, and increases power efficiency. It also makes it possible to avoid congestion in RDMA connections and balance the network load.
  • the relay device 10 is a switch, but a router or the like may be used as another example of the relay device 10.
  • the hardware configuration of each of the units 11 to 14 of the relay device 10 is arbitrary, and each of the units 11 to 13 may be configured by a processor that executes a program.
  • the present invention is not limited to the above-described embodiments and modifications.
  • the present invention includes various modifications to the above-described embodiments and modifications that can be understood by a person skilled in the art within the scope of the technical concept of the present invention.
  • the configurations listed in the above-described embodiments and modifications can be combined as appropriate to the extent that there is no contradiction. It is also possible to delete any of the above-described configurations.
  • a relay device that relays data transferred by RDMA (Remote Direct Memory Access), an information acquisition unit that acquires a plurality of pieces of connection information issued by each of a plurality of remote computers when the data is to be received by the RDMA transfer; a transfer destination selection unit that, when receiving a packet including the entire data or a divided part of the data from a local computer, selects, from the plurality of remote computers, a remote computer to which the packet is to be transferred as a transfer destination computer; a packet output unit that causes the packet to include connection information issued by the destination computer among the plurality of connection information acquired by the information acquisition unit, and outputs the packet with the connection information included therein, thereby RDMA-transferring the packet to the destination computer;
  • a relay device comprising: (Appendix 2) the transfer destination selection unit selects, from among the plurality of remote computers, a remote computer having a communication status with the relay
  • the relay device of claim 1 (Appendix 3) the relay device is a network switch, The information acquisition unit, the transfer destination selection unit, and the packet output unit are configured as a whole in at least one of an ASIC (Application Specific Integrated Circuit) and an FPGA (Field-Programmable Gate Array). 3.
  • the relay device according to claim 1 or 2. (Appendix 4) the transfer destination selection unit selects the transfer destination computer by load balancing; 4.
  • a relay device according to claim 1. (Appendix 5) the packet includes a Packet Sequence Number (PSN) that identifies the number of packets sent by the local computer; The packet output unit outputs a packet when the PSN contained in the packet is within any of the ranges of PSNs defined for each RDMA transfer. 5.
  • PSN Packet Sequence Number
  • a relay device according to any one of claims 1 to 4.
  • the packet is a packet including a portion of the data that is divided and is a non-leading packet that is not a leading packet
  • the transfer destination selection unit refers to a table showing a correspondence relationship between each range of the PSN and a remote computer to be a transfer destination, and when the PSN included in the non-first packet is within any of the ranges of the PSN, selects the remote computer corresponding to the range as the transfer destination computer; 6.
  • Each of the plurality of pieces of connection information is issued in units of a QP (Queue Pair) or a memory area of a data transfer destination,
  • the transfer destination selection unit selects a QP or memory area to which the data is to be transferred when selecting the transfer destination computer
  • the packet output unit includes, in the packet, the connection information issued by the destination computer for the QP or the memory area selected by the destination selection unit. 7.
  • a relay device according to any one of claims 1 to 6. (Appendix 8) the packet output unit rewrites a header of the packet so that the packet is transmitted to the destination computer via another transmission path when the transmission path of the packet to the destination computer is out of order; 8.
  • a relay device according to any one of claims 1 to 7.
  • 10 relay device, 11... information acquisition unit, 12... forwarding destination selection unit, 13... packet output unit, 14... storage unit, 90... computer, 90A... CPU, 90B... main memory, 90C... storage device, 90D... communication interface, 90E... R-NIC, 91... first local computer, 92... second local computer, 93... first remote computer, 94... second remote computer.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

A relay device (10) relays data to be transferred by remote direct memory access (RDMA) and is provided with: an information acquisition unit that acquires a plurality of pieces of connection information issued when each of a plurality of remote computers (93, 94) is determined to receive the data by RDMA transfer; a transfer destination selection unit that, when a packet containing the entire data or a divided portion of the data is received from a local computer (91), selects, as a transfer destination computer, the remote computer which is the transfer destination of the packet from among the plurality of remote computers; and a packet output unit that causes the packet to include the connection information issued by the transfer destination computer from among the plurality of pieces of connection information acquired by the information acquisition unit and outputs the included packet, thereby performing the RDMA transfer of the packet to the transfer destination computer.

Description

中継装置Repeater
 本発明は、RDMA(Remote Direct Memory Access)転送されるデータを中継する、ネットワークスイッチなどの中継装置に関する。 The present invention relates to a relay device, such as a network switch, that relays data transferred via RDMA (Remote Direct Memory Access).
 RDMAは、通信プロトコルの1つである。RDMAでは、ローカルコンピュータが、リクエストを発行してリモートコンピュータとのコネクションを確立する。その後、リモートコンピュータは、自身のメモリに格納されているデータをリモートコンピュータのメインメモリに直接転送する(非特許文献1)。 RDMA is a communications protocol. In RDMA, a local computer issues a request to establish a connection with a remote computer. The remote computer then transfers data stored in its own memory directly to the remote computer's main memory (Non-Patent Document 1).
 従来のRDMAでは、コネクションの確立時にローカルコンピュータとリモートコンピュータとが一対一で決定されてしまう。このため、ローカルコンピュータからデータが出力されたあと、当該データを複数の転送先候補のいずれかに転送するといったことができない。ローカルコンピュータから出力されたデータを複数の転送先候補のいずれかに転送できれば、例えば、複数のリモートコンピュータのうちビジー状態のリモートコンピュータを避けてデータを転送できるといった利点がある。 In conventional RDMA, the local computer and remote computer are determined on a one-to-one basis when a connection is established. For this reason, after data is output from the local computer, it is not possible to transfer the data to one of multiple candidate transfer destinations. If data output from the local computer could be transferred to one of multiple candidate transfer destinations, there would be an advantage in that, for example, data could be transferred while avoiding busy remote computers among multiple remote computers.
 本発明は、ローカルコンピュータから出力されたデータを複数の転送先候補のいずれかにRDMA転送できるようにすることを課題とする。 The objective of the present invention is to enable RDMA transfer of data output from a local computer to one of multiple potential transfer destinations.
 上記課題を解決するために、本発明に係る中継装置は、RDMA(Remote Direct Memory Access)転送されるデータを中継する中継装置であって、複数のリモートコンピュータのそれぞれが前記データをRDMA転送で受け取るとしたときにそれぞれ発行した複数のコネクション情報を取得する情報取得部と、前記データの全体又は前記データを分割した一部を含むパケットをローカルコンピュータから受信したときに、前記複数のリモートコンピュータから前記パケットの転送先のリモートコンピュータを転送先コンピュータとして選択する転送先選択部と、前記情報取得部により取得された前記複数のコネクション情報のうち、前記転送先コンピュータが発行したコネクション情報を前記パケットに含ませ、含ませた前記パケットを出力することで、前記パケットを前記転送先コンピュータにRDMA転送するパケット出力部と、を備える。 In order to solve the above problems, the relay device of the present invention is a relay device that relays data transferred by RDMA (Remote Direct Memory Access), and includes an information acquisition unit that acquires multiple pieces of connection information issued by multiple remote computers when each of the multiple remote computers receives the data by RDMA transfer, a transfer destination selection unit that selects a remote computer to which the packet is to be transferred from the multiple remote computers as a transfer destination computer when a packet containing the entire data or a divided part of the data is received from a local computer, and a packet output unit that includes the connection information issued by the destination computer from the multiple connection information acquired by the information acquisition unit in the packet and outputs the packet with the connection information included, thereby transferring the packet to the destination computer by RDMA transfer.
 本発明によれば、ローカルコンピュータから出力されたデータを複数の転送先候補のいずれかにRDMA転送できる。 According to the present invention, data output from a local computer can be RDMA transferred to one of multiple potential transfer destinations.
図1は、本発明の第1実施形態に係る中継装置を含むシステムの構成図である。FIG. 1 is a configuration diagram of a system including a relay device according to a first embodiment of the present invention. 図2は、図1のローカル、リモートといったコンピュータのハードウェア構成図である。FIG. 2 is a hardware configuration diagram of the local and remote computers in FIG. 図3は、図1の中継装置のブロック図である。FIG. 3 is a block diagram of the relay device of FIG. 図4は、図1の中継装置が実行する処理を説明するためのフローチャートである。FIG. 4 is a flowchart for explaining the process executed by the relay device of FIG. 図5は、管理テーブルの構成例を示す図である。FIG. 5 is a diagram illustrating an example of the configuration of the management table. 図6は、第2実施形態に係る中継装置のブロック図である。FIG. 6 is a block diagram of a relay device according to the second embodiment. 図7は、PSNテーブルの構成例を示す図である。FIG. 7 is a diagram showing an example of the configuration of a PSN table. 図8は、第3実施形態に係る中継装置のブロック図である。FIG. 8 is a block diagram of a relay device according to the third embodiment. 図9は、第3実施形態に係る中継装置のブロック図である。FIG. 9 is a block diagram of a relay device according to the third embodiment. 図10は、第4実施形態に係る中継装置を備えるシステムの構成図である。FIG. 10 is a configuration diagram of a system including a relay device according to the fourth embodiment.
 以下、本発明の実施形態及びその変形例について図面を参照して説明する。 The following describes an embodiment of the present invention and its modified examples with reference to the drawings.
(第1実施形態)
 図1に示すように本実施形態に係る中継装置10は、RDMA(Remote Direct Memory Access)での転送元としての第1及び第2ローカルコンピュータ(単にローカルともいう)91及び92と、インターネットなどのネットワークNW及び不図示のルータなどを介して通信可能に接続されている。中継装置10は、さらに、RDMAでの転送先としての第1及び第2リモートコンピュータ(単にリモートともいう)93及び94と、通信可能に有線又は無線接続されている。中継装置10は、ネットワークスイッチとして構成されている。
First Embodiment
1, the relay device 10 according to the present embodiment is communicatively connected to first and second local computers (also simply referred to as local) 91 and 92 as transfer sources in RDMA (Remote Direct Memory Access) via a network NW such as the Internet and a router (not shown). The relay device 10 is further communicatively connected by wire or wireless to first and second remote computers (also simply referred to as remote) 93 and 94 as transfer destinations in RDMA. The relay device 10 is configured as a network switch.
 リモート93及び94は、同じ処理を実行可能に設けられており、処理の冗長化を実現している。中継装置10は、ローカル91又は92からの処理対象データをリモート93及び94のいずれかに送信する。処理対象データは、当該データを含むパケットにより送信される。処理対象データは、そのデータ長が長い場合、複数のデータに分割され、分割後のデータをそれぞれ含むパケットにより送信される。リモート93及び94は、ローカル91及び92から送信されてくるビッグデータを分散処理する装置としてもよい。この場合、ローカル91及び92が送信するデータは、リモート93及び94のどちらに送信されてもよい。 The remotes 93 and 94 are configured to be able to execute the same processing, realizing processing redundancy. The relay device 10 transmits data to be processed from the local 91 or 92 to either the remote 93 or 94. The data to be processed is transmitted in a packet including the data. If the data to be processed is long, it is divided into multiple pieces of data and transmitted in packets including each of the divided pieces of data. The remotes 93 and 94 may be devices that perform distributed processing of big data transmitted from the locals 91 and 92. In this case, the data transmitted by the locals 91 and 92 may be transmitted to either the remote 93 or 94.
 図1のローカル91及び92、リモート93及び94を総称してコンピュータ90ともいう。ローカル91及び92とリモート93及び94のそれぞれ、つまり、各コンピュータ90は、図2に示す構成を有する。 The locals 91 and 92 and the remotes 93 and 94 in FIG. 1 are also collectively referred to as computers 90. Each of the locals 91 and 92 and the remotes 93 and 94, that is, each computer 90, has the configuration shown in FIG. 2.
 図2に示すように、コンピュータ90は、CPU(Central Processing Unit)90Aと、メインメモリ90Bと、CPU90Aにより実行されるOS(Operating System)、ユーザアプリケーションなどのプログラムを記憶する不揮発性の記憶装置90Cと、を備える。コンピュータ90は、さらに、外部と接続される通信インターフェイス90Dを備える。通信インターフェイス90Dは、例えば、RDMA転送を実現するためのHCA(Host Channel Adapter)としてのR-NIC(RDMA-Network Interface Card)90Eを含む。通信インターフェイス90Dは、複数のR-NIC90Eを含んでもよい。 As shown in FIG. 2, the computer 90 comprises a CPU (Central Processing Unit) 90A, a main memory 90B, and a non-volatile storage device 90C that stores an OS (Operating System) executed by the CPU 90A, user applications, and other programs. The computer 90 further comprises a communication interface 90D that is connected to the outside. The communication interface 90D includes, for example, an R-NIC (RDMA-Network Interface Card) 90E as an HCA (Host Channel Adapter) for realizing RDMA transfers. The communication interface 90D may include multiple R-NICs 90E.
 R-NIC90Eには、RDMA転送を行うときに使用される、それぞれがSQ(Send Queue)及びRQ(Receive Queue)からなる複数のQP(Queue Pair)が設けられている。RDMA転送の通信単位はWR(Work Request)と呼ばれる通信要求であり、通信されるデータは、WQE(Work Queue Element)としてSQ/RQに積まれる。 The R-NIC90E has multiple QPs (Queue Pairs), each consisting of an SQ (Send Queue) and an RQ (Receive Queue), that are used when performing RDMA transfers. The communication unit for RDMA transfers is a communication request called a WR (Work Request), and the data to be communicated is stacked in the SQ/RQ as a WQE (Work Queue Element).
 R-NIC90Eには、QP間でのWRの完了時に、SQ/RQに積まれたWQEがCQE(Completion Queue Entry)として積まれる、SQ/RQのそれぞれに対応したCQ(Completion Queue)も設けられている。 The R-NIC90E also has a Completion Queue (CQ) corresponding to each SQ/RQ, where the WQEs stacked in the SQ/RQ are stacked as Completion Queue Entries (CQEs) when a WR between QPs is completed.
 図3に示すように、中継装置10は、情報取得部11、転送先選択部12、パケット出力部13、及び、記憶部14を備える。上記各部11~14は、全体として、ASIC(Application Specific Integrated Circuit)とFPGA(Field-Programmable Gate Array)の少なくとも一方に構成されている。つまり、各部11~14の全部がASIC及びFPGAのいずれかに構成されてもよいし、各部11~14の一部がASIC及びFPGAのいずれか一方に構成され、各部11~14の残りがASIC及びFPGAのいずれか他方に構成されてもよい。 As shown in FIG. 3, the relay device 10 includes an information acquisition unit 11, a forwarding selection unit 12, a packet output unit 13, and a memory unit 14. The above units 11 to 14 are configured as a whole in at least one of an ASIC (Application Specific Integrated Circuit) and an FPGA (Field-Programmable Gate Array). In other words, the entire units 11 to 14 may be configured in either an ASIC or an FPGA, or a portion of the units 11 to 14 may be configured in either an ASIC or an FPGA, and the remainder of the units 11 to 14 may be configured in the other of an ASIC or an FPGA.
 中継装置10の上記各部11~14の動作を、図4を参照し、RDMA転送の流れとともに説明する。以下の説明では、RDMA転送が、ローカル91のメインメモリ90Bに用意されたデータを、リモート93及び94のいずれかのメインメモリ90Bに書き込みする転送であるものとして説明する。 The operation of each of the above-mentioned units 11 to 14 of the relay device 10 will be explained with reference to FIG. 4 along with the flow of RDMA transfer. In the following explanation, the RDMA transfer will be explained as a transfer in which data prepared in the main memory 90B of the local 91 is written to the main memory 90B of either the remote 93 or 94.
 図4に示すように、まず、ローカル91のCPU90Aが、RDMA転送のためのコネクション確立のリクエストを発行し、中継装置10とのコネクションを確立する(ステップS11)。その後、中継装置10の情報取得部11は、リモートR1及びR2のそれぞれとのコネクションを確立する(ステップS12及びS13)。 As shown in FIG. 4, first, the CPU 90A of the local 91 issues a request for establishing a connection for RDMA transfer and establishes a connection with the relay device 10 (step S11). After that, the information acquisition unit 11 of the relay device 10 establishes connections with each of the remotes R1 and R2 (steps S12 and S13).
 ローカル91とリモート93及び94とがそれぞれ備える各CPU90Aは、コネクション確立時に、メインメモリ90BにRDMA転送用のメモリ領域(ローカル側は、送信データを格納したメモリ領域、リモート側は、受信データを格納するメモリ領域)を決定する。また、各CPU90Aは、R-NIC90Eの複数のQPのいずれかを使用するか決定する。各CPU90Aは、これら決定結果を含むコネクション情報を生成し、中継装置10に送信する。送信されたコネクション情報は、中継装置10の情報取得部11により取得される。 When a connection is established, each CPU 90A provided in the local 91 and the remotes 93 and 94 determines the memory area for RDMA transfer in the main memory 90B (the local side is the memory area that stores the transmitted data, and the remote side is the memory area that stores the received data). Each CPU 90A also determines which of the multiple QPs of the R-NIC 90E to use. Each CPU 90A generates connection information including the results of these determinations and transmits it to the relay device 10. The transmitted connection information is acquired by the information acquisition unit 11 of the relay device 10.
 コネクション情報は、例えば、下記の情報を含む。
(1)QP番号:今回のRDMA転送で使用されるQPを特定する番号。
(2)今回のRDMA転送で使用されるメモリ領域の情報。
 ・key:RDMA転送されるデータを格納又は読み出す、メインメモリのメモリ領域(番地)にアクセスするためのキー。
 ・addr: 登録されたメモリ領域のアドレス
 ・buffer size:登録されたメモリ領域のサイズ
(3)RDMAに使用されるR-NICのグローバルID及びローカルID、及び、これを有するコンピュータのグローバルID及びローカルID。
The connection information includes, for example, the following information:
(1) QP number: A number that identifies the QP used in this RDMA transfer.
(2) Information about the memory area used in this RDMA transfer.
Key: A key for accessing a memory area (address) in the main memory where data to be transferred via RDMA is stored or read.
addr: address of the registered memory area; buffer size: size of the registered memory area. (3) Global ID and local ID of the R-NIC used for RDMA, and global ID and local ID of the computer having this R-NIC.
 中継装置10の情報取得部11は、取得した、ローカル91のコネクション情報と、リモート93のコネクション情報及びリモート94のコネクション情報と、を互いに対応付けて記憶部14の管理テーブル(図5参照)に登録する(ステップS14)。この実施の形態では、RDMA転送されるデータが、複数の転送先候補(ここでは、リモート93及び94)のいずれかに転送される。管理テーブルは、データの転送先候補を管理するために設けられている。管理テーブルには、図5に示すように、RDMA転送の単位ごとに、転送元(ローカル91など)のコネクション情報と、転送先候補(リモート93及び94)の各コネクション情報と、が対応づけられて登録されている。なお、コネクション情報には、そのコネクションを管理するアプリケーションの情報がさらに登録されてもよい。 The information acquisition unit 11 of the relay device 10 registers the acquired connection information of the local 91, the connection information of the remote 93, and the connection information of the remote 94 in the management table (see FIG. 5) of the storage unit 14 in association with each other (step S14). In this embodiment, data to be RDMA transferred is transferred to one of multiple transfer destination candidates (here, remotes 93 and 94). The management table is provided to manage the data transfer destination candidates. As shown in FIG. 5, the management table registers the connection information of the transfer source (such as the local 91) and each connection information of the transfer destination candidates (remote 93 and 94) in association with each other for each unit of RDMA transfer. Note that the connection information may further include information on the application that manages the connection.
 中継装置10の情報取得部11は、コネクション情報の登録通知をローカル91に返信する(ステップS15)。 The information acquisition unit 11 of the relay device 10 returns a registration notification of the connection information to the local 91 (step S15).
 ローカル91は、コネクション情報の登録通知を受けて、RDMA転送を行う(ステップS16)。つまり、ローカル91のR-NIC90Eは、送信データをメインメモリ90Bから読み出し、この送信データをペイロードに含むパケットを生成して中継装置10に送信する。このとき、R-NIC90Eは、ローカル91側のコネクション情報をパケットに含ませる。パケットにおけるコネクション情報の位置は、ヘッダに限らず任意である。 The local 91 receives the registration notification of the connection information and performs an RDMA transfer (step S16). That is, the R-NIC 90E of the local 91 reads the transmission data from the main memory 90B, generates a packet including this transmission data in the payload, and transmits it to the relay device 10. At this time, the R-NIC 90E includes the connection information on the local 91 side in the packet. The position of the connection information in the packet is not limited to the header and can be any position.
 ローカル91からのパケットは、中継装置10の転送先選択部12により受信される。転送先選択部12は、受信したパケットの転送先を選択する(ステップS17)。 The packet from the local 91 is received by the forwarding destination selection unit 12 of the relay device 10. The forwarding destination selection unit 12 selects the forwarding destination of the received packet (step S17).
 ステップS17の一例を以下説明する。まず、転送先選択部12は、先頭のパケットのヘッダから転送元(ローカル91)のコネクション情報を抽出し、抽出した転送元のコネクション情報に基づいて管理テーブル(図5)を参照する。転送先選択部12は、転送元のコネクション情報に対応する、複数のコネクション情報を管理テーブルから取得する。転送先選択部12は、取得した複数のコネクション情報がそれぞれ含むグローバルIDなどによりそれぞれ特定される複数のリモート、ここでは、リモート93及び94を転送先候補として認識する。その後、転送先選択部12は、認識した転送先候補のうちのいずれかを転送先として選択する。 An example of step S17 is described below. First, the destination selection unit 12 extracts the connection information of the source (local 91) from the header of the first packet, and refers to the management table (Figure 5) based on the extracted connection information of the source. The destination selection unit 12 acquires from the management table a number of pieces of connection information corresponding to the connection information of the source. The destination selection unit 12 recognizes a number of remotes, here remotes 93 and 94, as destination candidates, each of which is identified by a global ID or the like contained in the acquired number of connection information. The destination selection unit 12 then selects one of the recognized destination candidates as the destination.
 転送先の選択方法は任意である。転送先は、例えば、リモート93及び94と中継装置10との通信状況などを考慮して選択される。通信状況は、ネットワークの状況及びリモートのビジー状態などが含まれる。例えば、転送先選択部12は、リモート93との通信に要した時間(つまり、所定データの送信タイミングからレスポンスの受信タイミングまでの時間)と、リモート94との通信に要した時間と、をそれぞれ計測し、計測した両時間のうち最短の時間のリモートを転送先として選択する。前記通信としては、任意の通信が採用される。例えば、コネクション確立のための通信が採用されてもよいし、通信状況を把握するために専用に行われる通信であってもよい。このように、転送先として、通信状況が予め定められた所定基準よりも良いリモートが選択されるとよい。他の例として、転送先は、ランダムに選択されてもよい。他の例として、転送先選択部12は、RDMA通信を行った回数を、リモートごとにカウントし、カウント値が最小のリモートを転送先としてもよい。また、転送先選択部12は、リモートを順番に選択してもよい。また、管理テーブルの各コネクション情報に重みづけを行った上で、コネクション情報を選択することで、リモートを選択してもよい。これらのように、転送先の選択に際して、ロードバランスが行われもよい。 The method of selecting the transfer destination is arbitrary. The transfer destination is selected, for example, taking into consideration the communication status between the remotes 93 and 94 and the relay device 10. The communication status includes the network status and the busy state of the remotes. For example, the transfer destination selection unit 12 measures the time required for communication with the remote 93 (i.e., the time from the transmission timing of the specified data to the reception timing of the response) and the time required for communication with the remote 94, and selects the remote with the shortest time of the two measured times as the transfer destination. Any communication is adopted as the communication. For example, communication for establishing a connection may be adopted, or communication performed exclusively to grasp the communication status may be adopted. In this way, it is preferable that the remote with a communication status better than a predetermined standard is selected as the transfer destination. As another example, the transfer destination may be selected randomly. As another example, the transfer destination selection unit 12 may count the number of times RDMA communication has been performed for each remote, and the remote with the smallest count value may be the transfer destination. The transfer destination selection unit 12 may also select the remotes in order. Also, the remote may be selected by weighting each connection information in the management table and then selecting the connection information. In this way, load balancing may be performed when selecting the transfer destination.
 ステップS17のあとは、パケット出力部13が、転送先選択部12により選択された転送先のコネクション情報を、パケットのヘッダに書き込む(ステップS18)。ここでは、リモート93が転送先として決定されたものとする。つまり、パケット出力部13は、管理テーブルに登録された、今回送信されてきたコネクション情報に対応するリモート93のコネクション情報を、ヘッダに書き込む。 After step S17, the packet output unit 13 writes the connection information of the destination selected by the destination selection unit 12 into the header of the packet (step S18). Here, it is assumed that the remote 93 has been determined as the destination. In other words, the packet output unit 13 writes the connection information of the remote 93 registered in the management table, which corresponds to the connection information sent this time, into the header.
 その後、パケット出力部13は、書き込み後のパケットを出力する。出力されたパケットは、コネクション情報によりリモート93にRDMA転送される(ステップS19)。 Then, the packet output unit 13 outputs the packet after writing. The output packet is RDMA transferred to the remote 93 according to the connection information (step S19).
 なお、ローカル91がRDMA転送されるデータは、複数のパケットに分けて送信されてもよい。この場合、ステップS16~S19がパケット数分繰り返し行われる。ただし、ステップS17では、先頭のパケットと同じ転送先が選択される。これにより、一連の複数のパケットは、同じリモート、さらには、同じR-NICに転送される。データが複数のパケットに分けて送信されているか否かは、パケットのヘッダに含まれるデータ長などから特定されればよい。 The data transferred by RDMA from the local 91 may be divided into multiple packets and sent. In this case, steps S16 to S19 are repeated the number of times equal to the number of packets. However, in step S17, the same destination as the first packet is selected. This causes a series of multiple packets to be transferred to the same remote, and even the same R-NIC. Whether or not the data is being sent in multiple packets can be determined from the data length included in the packet header, etc.
 リモート93のR-NIC90Eは、パケットのヘッダのコネクション情報を参照して、RDMA転送のための処理を行う。これにより、パケットのペイロードのデータが、R-NIC90E(上記(3)により特定)に設けられたQP(上記(1)により特定)のRQにWQEとして格納され、その後、メインメモリ90Bのメモリ領域(上記(2)により特定)に転送される。転送後、WQEは前記RQに対応するCQにCQEとして格納され、RQに格納されたパケットは開放される。 The R-NIC 90E of the remote 93 references the connection information in the packet's header and performs processing for RDMA transfer. As a result, the packet's payload data is stored as a WQE in the RQ of the QP (specified by (1) above) provided in the R-NIC 90E (specified by (3) above), and then transferred to the memory area of the main memory 90B (specified by (2) above). After transfer, the WQE is stored as a CQE in the CQ corresponding to the RQ, and the packet stored in the RQ is released.
 その後、リモート93は、転送が終了したことを通知するレスポンスを中継装置10を介して転送元であるローカル91に送信する(ステップS20及びS21)。 Then, the remote 93 sends a response to the local 91, which is the source of the transfer, via the relay device 10 to notify that the transfer has ended (steps S20 and S21).
 その後、ローカル91と中継装置10とは、コネクションを切断する(ステップS22)。この切断に応じて中継装置10は、今回使用したコネクション情報を含むレコードを管理テーブルから削除し(ステップS23)、リモート93及び94に対して、確保したメモリ領域などの解放を指示する(ステップS24及びS25)。 Then, the local 91 and the relay device 10 disconnect the connection (step S22). In response to this disconnection, the relay device 10 deletes the record including the connection information used this time from the management table (step S23), and instructs the remote devices 93 and 94 to release the reserved memory area, etc. (steps S24 and S25).
 以上のような動作は、ローカル92についても同様に行われる。 The above operations are also performed for local 92.
 本実施形態では、1のデータについてのRDMA転送において、リモート93及び94の両者でコネクション情報が発行されるが、転送先として選択されなかったリモートで生成されたコネクション情報は使用されない。つまり、コネクション情報は、リモート93及び94のそれぞれが転送対象のデータをRDMA転送で受け取るとしたときにそれぞれ発行した情報である。情報取得部11は、このようなコネクション情報を取得する。そして、転送先選択部12は、転送対象のデータの全体又は当該データを分割した一部を含むパケットをローカル91又は92から受信したときに、転送先の候補としてのリモート93及び94から、パケットの転送先のリモートを転送先リモートとして選択する。そして、パケット出力部13は、情報取得部11により取得された複数のコネクション情報のうち、転送先リモートが発行したコネクション情報をパケットのヘッダに含ませ、含ませたパケットを出力する。ヘッダに含まれたコネクション情報は、転送先リモートに対するRDMA転送を実現するために必要な情報(より具体的には、データの転送先を指定する情報)であるので、パケット出力部13から出力されたパケットが転送先リモートにRDMA転送される。このような構成によれば、中継装置10によりデータの転送先が決定されるので、ローカル91又は92から出力されたデータを複数の転送先候補のいずれかにRDMA転送できる。 In this embodiment, in the RDMA transfer of one data item, both remotes 93 and 94 issue connection information, but the connection information generated by the remote that was not selected as the transfer destination is not used. In other words, the connection information is information issued by each of remotes 93 and 94 when the remotes 93 and 94 receive the data to be transferred by RDMA transfer. The information acquisition unit 11 acquires such connection information. Then, when the transfer destination selection unit 12 receives a packet containing the entire data to be transferred or a divided part of the data from the local 91 or 92, it selects the remote to which the packet is to be transferred from the remotes 93 and 94 as candidate transfer destinations as the transfer destination remote. Then, the packet output unit 13 includes the connection information issued by the transfer destination remote among the multiple connection information items acquired by the information acquisition unit 11 in the header of the packet and outputs the included packet. The connection information included in the header is information necessary to realize the RDMA transfer to the transfer destination remote (more specifically, information specifying the transfer destination of the data), so the packet output from the packet output unit 13 is RDMA transferred to the transfer destination remote. With this configuration, the data transfer destination is determined by the relay device 10, so data output from the local 91 or 92 can be RDMA transferred to one of multiple transfer destination candidates.
 特に、中継装置10との間の通信状況が所定基準よりも良い(例えば、最も良い)リモートが転送先として選択されることで、例えば、複数のリモートのうちビジー状態のリモートを避けてデータが転送されるといった利点が得られる。これにより、低レイテンシ及び高スループットが得られる。また、転送先選択部12が、ロードバランシングにより転送先リモートの選択を行うことにより、RDMA転送の転送先のロードバランスが実現される。 In particular, by selecting as the transfer destination a remote with a better (e.g., the best) communication status with the relay device 10 than a predetermined standard, an advantage is obtained, for example, that data is transferred while avoiding a busy remote among multiple remotes. This provides low latency and high throughput. In addition, the transfer destination selection unit 12 selects the transfer destination remote by load balancing, thereby achieving load balancing of the transfer destination for RDMA transfer.
 また、中継装置10がネットワークスイッチであることで、転送先の選択について、R-NIC90E又はCPU90Aに負荷がかかることが防止される。さらに、情報取得部11、転送先選択部12、及び、パケット出力部13が、全体として、ASICとFPGAとのうちの少なくとも一方に構成されることで、仮に中継装置10をサーバコンピュータとした場合に比べて、低レイテンシ、高スループット、高電力効率となる。 Also, because the relay device 10 is a network switch, the load on the R-NIC 90E or CPU 90A is prevented from being placed on the selection of the destination. Furthermore, the information acquisition unit 11, the destination selection unit 12, and the packet output unit 13 are configured as an ASIC or an FPGA as a whole, which results in lower latency, higher throughput, and higher power efficiency than if the relay device 10 were a server computer.
(第2実施形態)
 ローカル91及び92とリモート93及び94のそれぞれにおいて、QPごとに、PSN(Packet Sequence Number)が管理されてもよい。PSNは、従来のRDMAで用いられているものでよい。PSNは、パケットが送信又は受信されるごとにインクリメントされる。転送される1のデータが複数のパケットにより転送される場合、各パケットのPSNは、連続した番号となる。また、複数のデータをそれぞれ対象とした複数回のRDMA転送があったとき、PSNは通算の値となる。QPから転送されるデータのパケットのヘッダには、送信元及び送信先の各QPのPSNが付与される。
Second Embodiment
A PSN (Packet Sequence Number) may be managed for each QP in each of the locals 91 and 92 and the remotes 93 and 94. The PSN may be the same as that used in conventional RDMA. The PSN is incremented each time a packet is transmitted or received. When one piece of data is transferred using multiple packets, the PSN of each packet becomes a consecutive number. When multiple RDMA transfers are performed with multiple pieces of data as the target, the PSN becomes a cumulative value. The header of a packet of data transferred from a QP is given the PSN of each QP of the source and destination.
 本実施形態では、中継装置10でも、ローカル91及び92とリモート93及び94の各QPについての各PSNを管理する。このような場合、ローカルを1つとし、転送先候補のリモートを複数とすると、ローカルのPSNの増加量=複数のリモートの各PSNの増加量の合計、となる。しかしながら、下記のような問題がある。なお、以下の課題は、RDMA転送するデータを複数のパケットに分けて転送することを前提としている。 In this embodiment, the relay device 10 also manages each PSN for each QP of local 91 and 92 and remote 93 and 94. In such a case, if there is one local and multiple remotes as transfer destination candidates, the increase in the local PSN = the sum of the increase in each of the multiple remote PSNs. However, there are problems as described below. Note that the following issues are based on the premise that data to be transferred via RDMA is divided into multiple packets and transferred.
(問題1)
 中継装置10が、ローカル91及び92とリモート93及び94の各PSNをパケットが到来するたびインクリメントしてPSNを管理することを考える。この場合において、仮に、ローカル91が、リモート93及び94の両者に、互いに異なるRDMA転送のパケットをそれぞれ送信し、そのレスポンスの各パケットが中継装置10に返されたとする。そして、レスポンスがあったときに、中継装置10でカウントするローカル91側のPSNがインクリメントされるとすると、レスポンスが同時期に中継装置10に到達したときに、各RDMA転送に係るローカル91側のPSNが統合されてしまい、当該PSNは正しい値とはならない。その結果、次のパケットのヘッダに含まれるPSNと、中継装置10で管理しているPSNとが合わなくなりエラーとなる可能性がある。
(Question 1)
Consider a case where the relay device 10 manages the PSNs by incrementing the PSNs of the locals 91 and 92 and the remotes 93 and 94 each time a packet arrives. In this case, assume that the local 91 transmits different RDMA transfer packets to both the remotes 93 and 94, and each response packet is returned to the relay device 10. If the PSN of the local 91 side counted by the relay device 10 is incremented when a response is received, when the responses arrive at the relay device 10 at the same time, the PSNs of the local 91 side related to each RDMA transfer are integrated, and the PSN does not become a correct value. As a result, the PSN included in the header of the next packet may not match the PSN managed by the relay device 10, resulting in an error.
(問題2)
 リモート93及び94は、自身がパケットを受信するたびにPSNをカウントし、レスポンス時のカウント値をレスポンスのパケットに含ませてもよい。この場合、最後のパケットに対応したレスポンス(ACK)の前に、別のRDMA転送のパケットがリモートに届くとすると、当該リモートは、次のような挙動をする。下記の「write req」は、上記のRDMA転送されるパケットに相当する。
(Question 2)
The remotes 93 and 94 may count the PSN each time they receive a packet, and include the count value at the time of response in the response packet. In this case, if another RDMA transfer packet arrives at the remote before the response (ACK) corresponding to the last packet, the remote behaves as follows. The following "write req" corresponds to the above RDMA transferred packet.
 write req #1 first (PSN=k)
 write req #1 middle (PSN=k+1)
 ...
 write req #1 middle (PSN=k+n-2)
 write req #1 last (PSN=k+n-1):これが、最後のパケットとなる。
 write req #2 first (PSN=k+n)
 write req #2 middle (PSN=k+n+1)
 ack #1 (PSN=k+n+1)
write req #1 first (PSN=k)
write req #1 middle (PSN=k+1)
...
write req #1 middle (PSN=k+n-2)
write req #1 last (PSN=k+n-1): This is the last packet.
write req #2 first (PSN=k+n)
write req #2 middle (PSN=k+n+1)
ack #1 (PSN=k+n+1)
 レスポンスを受けるローカルは、PSN=k+n-1を期待するが、実際はPSN=k+n+1となるため、転送エラーを検出してしまう。 The local device receiving the response expects PSN = k + n - 1, but in reality PSN = k + n + 1, so a transmission error is detected.
(問題3)
 パケットの再送が発生した場合、ローカル91などではPSNがロールバックされる。ここで、中継装置10が、ローカル91及び92とリモート93及び94の各PSNをパケットが到来するたびインクリメントしてPSNを管理する場合、ロールバックが起きると、上記式が崩れ、正常なパケット転送ができなくなる。
(Question 3)
When a packet retransmission occurs, the PSN is rolled back in the local 91, etc. If the relay device 10 manages the PSNs by incrementing the PSNs of the locals 91 and 92 and the remotes 93 and 94 every time a packet arrives, the above formula will not hold true if a rollback occurs, and normal packet transfer will not be possible.
(解決手段)
 上記のような課題を解決するため、本実施形態では、図6に示すように、記憶部14にPSNテーブルを設ける。
(Solution)
In order to solve the above-mentioned problems, in this embodiment, a PSN table is provided in the storage unit 14 as shown in FIG.
 PSNテーブルの構成例を図7に示す。図7中、ローカルIDは、ローカルを識別する情報であり、ここでは、ローカルに付した符号(91又は92)と同じ値が用いられている。状態は、データを複数のパケットにより送信する場合の送信途中を示す「Sending」、レスポンスの受信済みを示す「Acked」、及び、RDMA転送が行われていないことを示す「Idle」を取り得る。転送先リモートIDは、転送先に選択されたリモートを識別する情報であり、ここでは、リモートに付した符号(93又は94)と同じ値が用いられている。ローカルスタートPSNは、先頭のパケットが転送されたときのローカル側のPSNであり、ローカルエンドPSNは、最後のパケットが転送されたときのローカル側のPSNである。リモートスタートPSNは、先頭のパケットが転送されたときのリモート側のPSNであり、リモートエンドPSNは、最後のパケットが転送されたときのリモート側のPSNである。PSNテーブルには、初期状態として、ローカルごとに複数行の情報格納領域が用意され、それぞれが「Idle」とされる。 The configuration example of the PSN table is shown in FIG. 7. In FIG. 7, the local ID is information for identifying the local, and here, the same value as the code (91 or 92) attached to the local is used. The state can be "Sending", which indicates that data is being sent in the middle of being sent when multiple packets are sent, "Acked", which indicates that a response has been received, or "Idle", which indicates that RDMA transfer is not being performed. The transfer destination remote ID is information for identifying the remote selected as the transfer destination, and here, the same value as the code (93 or 94) attached to the remote is used. The local start PSN is the PSN of the local side when the first packet is transferred, and the local end PSN is the PSN of the local side when the last packet is transferred. The remote start PSN is the PSN of the remote side when the first packet is transferred, and the remote end PSN is the PSN of the remote side when the last packet is transferred. In the initial state, the PSN table has multiple rows of information storage areas for each local, and each is set to "Idle".
 転送先選択部12は、中継装置10がローカル91又は92から受信したパケットのヘッダに含まれるPSNと、PSNテーブルのローカルスタートPSNからローカルエンドPSNまでの各範囲とを比較する。前記ヘッダに含まれるPSNが、各範囲のいずれにも入らない場合、当該パケットは、通常、1のデータを分割転送するときの複数のパケットの先頭のパケットである。この場合、転送先選択部12は、当該パケットのヘッダに、当該パケットが先頭である旨を示す情報が含まれることを条件に、上記転送先の選択を行うとともに、当該パケットのヘッダの情報などに基づいて、上記各情報を取得する。 The destination selection unit 12 compares the PSN contained in the header of a packet that the relay device 10 receives from the local 91 or 92 with each range from the local start PSN to the local end PSN in the PSN table. If the PSN contained in the header does not fall within any of the ranges, the packet is usually the first packet of multiple packets when a single piece of data is divided and transferred. In this case, the destination selection unit 12 selects the destination on the condition that the header of the packet contains information indicating that the packet is the first, and obtains the above information based on the information in the header of the packet, etc.
 転送先選択部12は、先頭のパケットのヘッダに含まれるPSNをローカルスタートPSNとする。また、転送先選択部12は、先頭のパケットのヘッダに含まれるデータ長を、1パケットのデータ長で除し、得られた値をローカルスタートPSNに加算することで、ローカルエンドPSNを得ることができる。さらに、転送先選択部12は、例えば、転送先を選択する際に、選択したリモートと通信して、現在のPSNをリモートから取得する。転送先選択部12は、取得したPSNに1加算したものを、リモートスタートPSNとする。さらに、転送先選択部12は、ローカルスタートPSNに加算した前記値をリモートスタートPSNに加算し、加算した値をリモートスタートPSNとする。 The transfer destination selection unit 12 sets the PSN included in the header of the first packet as the local start PSN. The transfer destination selection unit 12 can obtain the local end PSN by dividing the data length included in the header of the first packet by the data length of one packet and adding the obtained value to the local start PSN. Furthermore, when selecting a transfer destination, the transfer destination selection unit 12 communicates with the selected remote and obtains the current PSN from the remote. The transfer destination selection unit 12 adds 1 to the obtained PSN and sets the result as the remote start PSN. The transfer destination selection unit 12 adds the value added to the local start PSN to the remote start PSN and sets the added value as the remote start PSN.
 転送先選択部12は、取得した各情報を「Idle」の列に登録する。「Idle」がない場合、転送先選択部12は、「Acked」のうち最も古いもの(例えば、PSNの値が小さいもの)に、上記各情報を上書き登録する。このときの状態は「Sending」とされる。「Idle」、「Acked」が無い場合、転送先選択部12は、パケットをドロップする。パケットのヘッダに、当該パケットが先頭である旨を示す情報が含まれていない場合も、転送先選択部12は、パケットをドロップする。 The destination selection unit 12 registers each piece of acquired information in the "Idle" column. If there is no "Idle", the destination selection unit 12 overwrites and registers each piece of information in the oldest "Acked" (for example, the one with the smallest PSN value). The state at this time is "Sending". If there is no "Idle" or "Acked", the destination selection unit 12 drops the packet. The destination selection unit 12 also drops the packet if the packet header does not contain information indicating that the packet is the first.
 転送先選択部12は、パケットのヘッダに含まれるPSNが、ローカルスタートPSNからローカルエンドPSNまでの各範囲のうちのいずれかに入っている場合、その範囲に対応する転送先リモートIDが識別するリモートを、当該パケットの転送先と決定する。転送先選択部12は、さらに、リモートスタートPSNに、ヘッダに含まれるPSNからローカルスタートPSNを減じた値を加算する。転送先選択部12は、この加算値を、リモート側のPSNとしてパケットのヘッダに書き込む。これにより、パケット出力部13によりパケットがリモート側に送信されたとき、パケットのPSNとリモートのPSNとが整合することになる。 If the PSN included in the packet header falls within any of the ranges from the local start PSN to the local end PSN, the destination selection unit 12 determines that the remote identified by the destination remote ID corresponding to that range is the destination of the packet. The destination selection unit 12 then adds the value obtained by subtracting the local start PSN from the PSN included in the header to the remote start PSN. The destination selection unit 12 writes this added value in the packet header as the PSN of the remote side. This ensures that when the packet is sent to the remote side by the packet output unit 13, the PSN of the packet and the PSN of the remote side will be consistent.
 転送先選択部12は、中継装置10がリモート93又は94から受信したパケットのヘッダに含まれるPSNと、PSNテーブルのリモートスタートPSNからリモートエンドPSNまでの各範囲とを比較する。ヘッダに含まれるPSNがいずれの範囲にも入っていない場合、転送先選択部12は、当該パケットをドロップする。転送先選択部12は、前記PSNが、前記各範囲のうちのいずれかに入っている場合、ローカルスタートPSNに、ヘッダに含まれるPSNからリモートスタートPSNを減じた値を加算する。転送先選択部12は、この加算値を、ローカル側のPSNとしてパケットのヘッダに書き込む。これにより、パケット出力部13によりパケットがローカル側に送信されたとき、パケットのPSNとローカルのPSNとが整合することになる。 The destination selection unit 12 compares the PSN contained in the header of a packet that the relay device 10 receives from the remote 93 or 94 with each range from the remote start PSN to the remote end PSN in the PSN table. If the PSN contained in the header is not within any of the ranges, the destination selection unit 12 drops the packet. If the PSN is within any of the ranges, the destination selection unit 12 adds a value obtained by subtracting the remote start PSN from the PSN contained in the header to the local start PSN. The destination selection unit 12 writes this added value in the header of the packet as the local side PSN. As a result, when the packet is sent to the local side by the packet output unit 13, the PSN of the packet and the local PSN will be consistent.
 以上のように、この実施形態では、パケットに含まれるPSNの正しい範囲(スタートPSN~エンドPSN)が、PSNテーブルによりRDMA転送ごとに定められるため、上記課題1及び2で示したような不都合は生じない。また、各パケットはPSNの範囲と比較されるため、ロールバックが生じても問題はない。なお、PSNテーブルの行数は、想定されるRDMA転送の最大数よりも例えば2倍以上用意されているとよい。 As described above, in this embodiment, the correct range of PSNs (start PSN to end PSN) contained in a packet is determined for each RDMA transfer by the PSN table, so the inconveniences described in issues 1 and 2 above do not occur. Also, because each packet is compared with the PSN range, there is no problem even if a rollback occurs. Note that the number of rows in the PSN table should be, for example, at least twice as many as the maximum number of anticipated RDMA transfers.
 以上のように、パケットが、ローカル91などが送信したパケットの数を特定するPSNを含み、パケット出力部13は、RDMA転送ごとに定められたPSNの各範囲(PSNテーブル)のいずれかの範囲に前記パケットが含む前記PSNが入っている場合に、当該パケットを出力する。これにより、正しいPSNを含むパケットが出力される。また、ロールバックなどにも対応される。 As described above, a packet includes a PSN that specifies the number of packets sent by the local 91 or the like, and the packet output unit 13 outputs a packet if the PSN included in the packet falls within any of the PSN ranges (PSN table) defined for each RDMA transfer. This allows a packet containing the correct PSN to be output. It also supports rollbacks and the like.
 また、転送するデータを分割した一部を含むパケットであってかつ先頭パケットではない非先頭パケットについて、転送先選択部12は、前記の各範囲と転送先となるリモートとの対応関係を示すPSNテーブルを参照し、前記の非先頭パケットに含まれる前記PSNが前記各範囲のうちのいずれかの範囲に入っているとき、当該範囲に対応する前記リモートを転送先として選択する。これにより、非先頭パケットが、先頭パケットと同じ転送先に転送される。 Furthermore, for a non-leading packet that is a packet that contains a portion of the data to be transferred and is not the leading packet, the destination selection unit 12 refers to a PSN table that indicates the correspondence between each of the ranges and the remote that is the destination, and when the PSN included in the non-leading packet falls within any of the ranges, selects the remote that corresponds to that range as the destination. This causes the non-leading packet to be transferred to the same destination as the leading packet.
(第3実施形態)
 図8の管理テーブルに示すように、リモートが発行するコネクション情報は、QP単位で発行されてもよい。この場合、転送先選択部12は、転送先のリモートを選択する際に転送先のQPまで選択し、パケット出力部13は、転送先選択部12により選択されたQPについて転送先のリモートが発行したコネクション情報をパケットに含ませる。QPが仮想環境(コンテナ、仮想マシン)と紐づいている場合、中継装置10は、仮想環境に対するスイッチないしロードバランサとして機能する。
Third Embodiment
As shown in the management table in Fig. 8, the connection information issued by the remote may be issued in units of QP. In this case, the transfer destination selection unit 12 selects the QP of the transfer destination when selecting the transfer destination remote, and the packet output unit 13 includes in the packet the connection information issued by the transfer destination remote for the QP selected by the transfer destination selection unit 12. When the QP is associated with a virtual environment (container, virtual machine), the relay device 10 functions as a switch or load balancer for the virtual environment.
 また、図9の管理テーブルに示すように、リモートが発行するコネクション情報は、データの転送先のメインメモリ90Bのメモリ領域(例えば、上記(2)のkey及びaddr)単位で発行されてもよい。この場合、転送先選択部12は、転送先のリモートを選択する際に転送先のメモリ領域まで選択し、パケット出力部13は、転送先選択部12により選択されたメモリ領域について転送先のリモートが発行したコネクション情報をパケットに含ませる。RDMAのアクセス先をデバイスメモリとする技術を用いる場合、中継装置10は、デバイスメモリ領域に対するスイッチないしロードバランサとして機能する。 Also, as shown in the management table of FIG. 9, the connection information issued by the remote may be issued in units of memory area (e.g., the key and addr in (2) above) of the main memory 90B to which the data is to be transferred. In this case, the transfer destination selection unit 12 selects the memory area of the transfer destination when selecting the transfer destination remote, and the packet output unit 13 includes in the packet the connection information issued by the transfer destination remote for the memory area selected by the transfer destination selection unit 12. When using a technology in which the RDMA access destination is the device memory, the relay device 10 functions as a switch or load balancer for the device memory area.
 QPなどは、RDMAを実行するアプリケーションごとに設定されるため、本実施形態によれば、アプリケーションごとのスイッチング、さらにはロードバランスが、リモート外から可能となる。また、ロードバランシングにより、スイッチングされるQPの公平性が担保され、アプリケーションのそれぞれの帯域が同程度になる。なお、本実施形態を主眼におくと、転送先選択部12による選択対象は、1つのリモートの中の複数のQP又はメモリ領域であればよい。 Since the QP and other parameters are set for each application that executes RDMA, this embodiment makes it possible to switch between applications and even load balance from outside the remote. Furthermore, load balancing ensures the fairness of the switched QPs, and the bandwidth of each application is approximately the same. With this embodiment as the main focus, the selection targets by the destination selection unit 12 may be multiple QPs or memory areas within one remote.
(第4実施形態)
 図8のように、中継装置10は、他のスイッチなどの中継装置101~103とともに、大規模なクラスタをファットツリーで構成してもよい。図8の中継装置10も、上記各部11~13を備える。このようなとき、中継装置10のパケット出力部13は、ローカル91又は92からのパケットを、RDMA転送先となるリモート93又は94に転送するときに、送信経路の不調(輻輳、断線など)を検出してもよい。図8では、中継装置10と中継装置102との間の経路(点線)が不調となっている。このような場合、パケット出力部13は、経路の不調を公知の方法で検出し、当該通信路を迂回する他の送信経路、つまり、中継装置10→中継装置101→中継装置103→リモート93又は94の送信経路でパケットを送信するように、パケットのヘッダを書き換える。迂回路は、例えば、不調の経路別に予め設定されていればよい。
Fourth Embodiment
As shown in FIG. 8, the relay device 10 may configure a large-scale cluster in a fat tree together with relay devices 101 to 103 such as other switches. The relay device 10 in FIG. 8 also includes the above-mentioned units 11 to 13. In such a case, the packet output unit 13 of the relay device 10 may detect a malfunction of the transmission path (congestion, disconnection, etc.) when transferring a packet from the local 91 or 92 to the remote 93 or 94 that is the RDMA transfer destination. In FIG. 8, the path (dotted line) between the relay device 10 and the relay device 102 is malfunctioning. In such a case, the packet output unit 13 detects the malfunction of the path by a known method, and rewrites the header of the packet so that the packet is transmitted via another transmission path that bypasses the communication path, that is, the transmission path of the relay device 10 → relay device 101 → relay device 103 → remote 93 or 94. The detour may be set in advance for each malfunctioning path, for example.
 従来、輻輳検知などは送信側が実施しており、送信側に負荷が発生していた。本実施形態では、輻輳検知などを中継装置(特に、ASIC又はFPGAを採用したスイッチの場合)が行うことで、負荷低減・低レイテンシ・高スループット・高電力効率で実施できる。また、RDMA接続の輻輳回避、ネットワークロードバランシングが可能になる。 Conventionally, congestion detection and other functions were performed by the sending side, which placed a load on the sending side. In this embodiment, congestion detection and other functions are performed by the relay device (especially in the case of a switch that uses an ASIC or FPGA), which reduces the load, reduces latency, increases throughput, and increases power efficiency. It also makes it possible to avoid congestion in RDMA connections and balance the network load.
(変形例)
 上記実施形態では、中継装置10をスイッチとしたが、中継装置10の他の例としてルータなどが採用されてもよい。中継装置10の各部11~14のハードウェア構成が任意であり、各部11~13は、プログラムを実行するプロセッサにより構成されてもよい。
(Modification)
In the above embodiment, the relay device 10 is a switch, but a router or the like may be used as another example of the relay device 10. The hardware configuration of each of the units 11 to 14 of the relay device 10 is arbitrary, and each of the units 11 to 13 may be configured by a processor that executes a program.
 本発明は、上記の実施形態及び変形例に限定されるものではない。例えば、本発明には、本発明の技術思想の範囲内で当業者が理解し得る、上記の実施形態及び変形例に対する様々な変更が含まれる。上記実施形態及び変形例に挙げた各構成は、矛盾の無い範囲で適宜組み合わせることができる。また、上記の各構成のうちの任意の構成を削除することも可能である。 The present invention is not limited to the above-described embodiments and modifications. For example, the present invention includes various modifications to the above-described embodiments and modifications that can be understood by a person skilled in the art within the scope of the technical concept of the present invention. The configurations listed in the above-described embodiments and modifications can be combined as appropriate to the extent that there is no contradiction. It is also possible to delete any of the above-described configurations.
(付記)
 上記実施の形態及び変形例を一例とする構成を以下に例示する。
(付記1)
 RDMA(Remote Direct Memory Access)転送されるデータを中継する中継装置であって、
 複数のリモートコンピュータのそれぞれが前記データをRDMA転送で受け取るとしたときにそれぞれ発行した複数のコネクション情報を取得する情報取得部と、
 前記データの全体又は前記データを分割した一部を含むパケットをローカルコンピュータから受信したときに、前記複数のリモートコンピュータから前記パケットの転送先のリモートコンピュータを転送先コンピュータとして選択する転送先選択部と、
 前記情報取得部により取得された前記複数のコネクション情報のうち、前記転送先コンピュータが発行したコネクション情報を前記パケットに含ませ、含ませた前記パケットを出力することで、前記パケットを前記転送先コンピュータにRDMA転送するパケット出力部と、
 を備える中継装置。
(付記2)
 前記転送先選択部は、前記複数のリモートコンピュータのうち、前記中継装置との間の通信状況が所定基準よりも良いリモートコンピュータを前記転送先コンピュータとして選択する、
 付記1に記載の中継装置。
(付記3)
 前記中継装置は、ネットワークスイッチであり、
 前記情報取得部、前記転送先選択部、及び、前記パケット出力部は、全体として、ASIC(Application Specific Integrated Circuit)とFPGA(Field-Programmable Gate Array)とのうちの少なくとも一方に構成されている、
 付記1又は2に記載の中継装置。
(付記4)
 前記転送先選択部は、ロードバランシングにより前記転送先コンピュータの選択を行う、
 付記1~3のいずれかに記載の中継装置。
(付記5)
 前記パケットは、前記ローカルコンピュータが送信したパケットの数を特定するPSN(Packet Sequence Number)を含み、
 前記パケット出力部は、RDMA転送ごとに定められたPSNの各範囲のいずれかの範囲に前記パケットが含む前記PSNが入っている場合に、当該パケットを出力する、
 付記1~4のいずれかに記載の中継装置。
(付記6)
 前記パケットは、前記データを分割した一部を含むパケットであってかつ先頭パケットではない非先頭パケットであり、
 前記転送先選択部は、前記PSNの各範囲と転送先となるリモートコンピュータとの対応関係を示すテーブルを参照し、前記非先頭パケットに含まれる前記PSNが前記PSNの各範囲のうちのいずれかの範囲に入っているとき、当該範囲に対応する前記リモートコンピュータを前記転送先コンピュータとして選択する、
 付記5に記載の中継装置。
(付記7)
 前記複数のコネクション情報のそれぞれは、QP(Queue Pair)単位又はデータの転送先のメモリ領域の単位で発行され、
 前記転送先選択部は、前記転送先コンピュータを選択する際に前記データの転送先のQP又はメモリ領域まで選択し、
 前記パケット出力部は、前記転送先選択部により選択された前記QP又は前記メモリ領域について前記転送先コンピュータが発行した前記コネクション情報を前記パケットに含ませる、
 付記1~6のいずれかに記載の中継装置。
(付記8)
 前記パケット出力部は、前記転送先コンピュータへの前記パケットの送信経路が不調の場合に、前記パケットを他の送信経路で前記転送先コンピュータに送信するように前記パケットのヘッダを書き換える、
 付記1~7のいずれかに記載の中継装置。
(Additional Note)
The following will exemplify configurations as examples of the above-described embodiment and modified examples.
(Appendix 1)
A relay device that relays data transferred by RDMA (Remote Direct Memory Access),
an information acquisition unit that acquires a plurality of pieces of connection information issued by each of a plurality of remote computers when the data is to be received by the RDMA transfer;
a transfer destination selection unit that, when receiving a packet including the entire data or a divided part of the data from a local computer, selects, from the plurality of remote computers, a remote computer to which the packet is to be transferred as a transfer destination computer;
a packet output unit that causes the packet to include connection information issued by the destination computer among the plurality of connection information acquired by the information acquisition unit, and outputs the packet with the connection information included therein, thereby RDMA-transferring the packet to the destination computer;
A relay device comprising:
(Appendix 2)
the transfer destination selection unit selects, from among the plurality of remote computers, a remote computer having a communication status with the relay device that is better than a predetermined standard as the transfer destination computer;
2. The relay device of claim 1.
(Appendix 3)
the relay device is a network switch,
The information acquisition unit, the transfer destination selection unit, and the packet output unit are configured as a whole in at least one of an ASIC (Application Specific Integrated Circuit) and an FPGA (Field-Programmable Gate Array).
3. The relay device according to claim 1 or 2.
(Appendix 4)
the transfer destination selection unit selects the transfer destination computer by load balancing;
4. A relay device according to claim 1.
(Appendix 5)
the packet includes a Packet Sequence Number (PSN) that identifies the number of packets sent by the local computer;
The packet output unit outputs a packet when the PSN contained in the packet is within any of the ranges of PSNs defined for each RDMA transfer.
5. A relay device according to any one of claims 1 to 4.
(Appendix 6)
the packet is a packet including a portion of the data that is divided and is a non-leading packet that is not a leading packet,
the transfer destination selection unit refers to a table showing a correspondence relationship between each range of the PSN and a remote computer to be a transfer destination, and when the PSN included in the non-first packet is within any of the ranges of the PSN, selects the remote computer corresponding to the range as the transfer destination computer;
6. The relay device of claim 5.
(Appendix 7)
Each of the plurality of pieces of connection information is issued in units of a QP (Queue Pair) or a memory area of a data transfer destination,
The transfer destination selection unit selects a QP or memory area to which the data is to be transferred when selecting the transfer destination computer,
The packet output unit includes, in the packet, the connection information issued by the destination computer for the QP or the memory area selected by the destination selection unit.
7. A relay device according to any one of claims 1 to 6.
(Appendix 8)
the packet output unit rewrites a header of the packet so that the packet is transmitted to the destination computer via another transmission path when the transmission path of the packet to the destination computer is out of order;
8. A relay device according to any one of claims 1 to 7.
 10…中継装置、11…情報取得部、12…転送先選択部、13…パケット出力部、14…記憶部、90…コンピュータ、90A…CPU、90B…メインメモリ、90C…記憶装置、90D…通信インターフェイス、90E…R-NIC、91…第1ローカルコンピュータ、92…第2ローカルコンピュータ、93…第1リモートコンピュータ、94…第2リモートコンピュータ。 10... relay device, 11... information acquisition unit, 12... forwarding destination selection unit, 13... packet output unit, 14... storage unit, 90... computer, 90A... CPU, 90B... main memory, 90C... storage device, 90D... communication interface, 90E... R-NIC, 91... first local computer, 92... second local computer, 93... first remote computer, 94... second remote computer.

Claims (8)

  1.  RDMA(Remote Direct Memory Access)転送されるデータを中継する中継装置であって、
     複数のリモートコンピュータのそれぞれが前記データをRDMA転送で受け取るとしたときにそれぞれ発行した複数のコネクション情報を取得する情報取得部と、
     前記データの全体又は前記データを分割した一部を含むパケットをローカルコンピュータから受信したときに、前記複数のリモートコンピュータから前記パケットの転送先のリモートコンピュータを転送先コンピュータとして選択する転送先選択部と、
     前記情報取得部により取得された前記複数のコネクション情報のうち、前記転送先コンピュータが発行したコネクション情報を前記パケットに含ませ、含ませた前記パケットを出力することで、前記パケットを前記転送先コンピュータにRDMA転送するパケット出力部と、
     を備える中継装置。
    A relay device that relays data transferred by RDMA (Remote Direct Memory Access),
    an information acquisition unit that acquires a plurality of pieces of connection information issued by each of a plurality of remote computers when the data is to be received by the RDMA transfer;
    a transfer destination selection unit that, when receiving a packet including the entire data or a divided part of the data from a local computer, selects, from the plurality of remote computers, a remote computer to which the packet is to be transferred as a transfer destination computer;
    a packet output unit that causes the packet to include connection information issued by the destination computer among the plurality of pieces of connection information acquired by the information acquisition unit, and outputs the packet with the connection information included therein, thereby RDMA-transferring the packet to the destination computer;
    A relay device comprising:
  2.  前記転送先選択部は、前記複数のリモートコンピュータのうち、前記中継装置との間の通信状況が所定基準よりも良いリモートコンピュータを前記転送先コンピュータとして選択する、
     請求項1に記載の中継装置。
    the transfer destination selection unit selects, from among the plurality of remote computers, a remote computer having a communication status with the relay device that is better than a predetermined standard as the transfer destination computer;
    The relay device according to claim 1 .
  3.  前記中継装置は、ネットワークスイッチであり、
     前記情報取得部、前記転送先選択部、及び、前記パケット出力部は、全体として、ASIC(Application Specific Integrated Circuit)とFPGA(Field-Programmable Gate Array)とのうちの少なくとも一方に構成されている、
     請求項1に記載の中継装置。
    the relay device is a network switch,
    The information acquisition unit, the transfer destination selection unit, and the packet output unit are configured as a whole in at least one of an ASIC (Application Specific Integrated Circuit) and an FPGA (Field-Programmable Gate Array).
    The relay device according to claim 1 .
  4.  前記転送先選択部は、ロードバランシングにより前記転送先コンピュータの選択を行う、
     請求項1に記載の中継装置。
    the transfer destination selection unit selects the transfer destination computer by load balancing;
    The relay device according to claim 1 .
  5.  前記パケットは、前記ローカルコンピュータが送信したパケットの数を特定するPSN(Packet Sequence Number)を含み、
     前記パケット出力部は、RDMA転送ごとに定められたPSNの各範囲のいずれかの範囲に前記パケットが含む前記PSNが入っている場合に、当該パケットを出力する、
     請求項1に記載の中継装置。
    the packet includes a Packet Sequence Number (PSN) that identifies the number of packets sent by the local computer;
    The packet output unit outputs a packet when the PSN contained in the packet is within any of the ranges of PSNs defined for each RDMA transfer.
    The relay device according to claim 1 .
  6.  前記パケットは、前記データを分割した一部を含むパケットであってかつ先頭パケットではない非先頭パケットであり、
     前記転送先選択部は、前記PSNの各範囲と転送先となるリモートコンピュータとの対応関係を示すテーブルを参照し、前記非先頭パケットに含まれる前記PSNが前記PSNの各範囲のうちのいずれかの範囲に入っているとき、当該範囲に対応する前記リモートコンピュータを前記転送先コンピュータとして選択する、
     請求項5に記載の中継装置。
    the packet is a packet including a portion of the data that is divided and is a non-leading packet that is not a leading packet,
    the transfer destination selection unit refers to a table showing a correspondence relationship between each range of the PSN and a remote computer to be a transfer destination, and when the PSN included in the non-first packet is within any of the ranges of the PSN, selects the remote computer corresponding to the range as the transfer destination computer;
    The relay device according to claim 5 .
  7.  前記複数のコネクション情報のそれぞれは、QP(Queue Pair)単位又はデータの転送先のメモリ領域の単位で発行され、
     前記転送先選択部は、前記転送先コンピュータを選択する際に前記データの転送先のQP又はメモリ領域まで選択し、
     前記パケット出力部は、前記転送先選択部により選択された前記QP又は前記メモリ領域について前記転送先コンピュータが発行した前記コネクション情報を前記パケットに含ませる、
     請求項1に記載の中継装置。
    Each of the plurality of pieces of connection information is issued in units of a QP (Queue Pair) or a memory area of a data transfer destination,
    The transfer destination selection unit selects a QP or memory area to which the data is to be transferred when selecting the transfer destination computer,
    The packet output unit includes, in the packet, the connection information issued by the destination computer for the QP or the memory area selected by the destination selection unit.
    The relay device according to claim 1 .
  8.  前記パケット出力部は、前記転送先コンピュータへの前記パケットの送信経路が不調の場合に、前記パケットを他の送信経路で前記転送先コンピュータに送信するように前記パケットのヘッダを書き換える、
     請求項1に記載の中継装置。
    the packet output unit rewrites a header of the packet so that the packet is transmitted to the destination computer via another transmission path when the transmission path of the packet to the destination computer is out of order;
    The relay device according to claim 1 .
PCT/JP2023/012870 2023-03-29 2023-03-29 Relay device WO2024201804A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2023/012870 WO2024201804A1 (en) 2023-03-29 2023-03-29 Relay device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2023/012870 WO2024201804A1 (en) 2023-03-29 2023-03-29 Relay device

Publications (1)

Publication Number Publication Date
WO2024201804A1 true WO2024201804A1 (en) 2024-10-03

Family

ID=92903626

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2023/012870 WO2024201804A1 (en) 2023-03-29 2023-03-29 Relay device

Country Status (1)

Country Link
WO (1) WO2024201804A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011058639A1 (en) * 2009-11-12 2011-05-19 富士通株式会社 Communication method, information processing device, and program
WO2022259452A1 (en) * 2021-06-10 2022-12-15 日本電信電話株式会社 Intermediate device, communication method, and program

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011058639A1 (en) * 2009-11-12 2011-05-19 富士通株式会社 Communication method, information processing device, and program
WO2022259452A1 (en) * 2021-06-10 2022-12-15 日本電信電話株式会社 Intermediate device, communication method, and program

Similar Documents

Publication Publication Date Title
KR101579917B1 (en) Method, device, system and storage medium for implementing packet transmission in pcie switching network
EP3942759A1 (en) System and method for facilitating efficient management of idempotent operations in a network interface controller (nic)
US8953631B2 (en) Interruption, at least in part, of frame transmission
EP3393094A1 (en) Method and device for allocating service traffic
US9130877B2 (en) Packet retransmission control apparatus and packet retransmission controlling method
US20080247411A1 (en) Method to operate a crossbar switch
CN110430135B (en) Message processing method and device
US11165705B2 (en) Data transmission method, device, and computer storage medium
US10164870B2 (en) Relaxed ordering network
EP4287034A1 (en) Network interface card, message transceiving method, and storage apparatus
US11782869B2 (en) Data transmission method and related device
US20230421451A1 (en) Method and system for facilitating high availability in a multi-fabric system
CN114900469B (en) Method, system, equipment and medium for controlling data flow of multi host network card
US20200127936A1 (en) Dynamic scheduling method, apparatus, and system
CN109716719B (en) Data processing method and device and switching equipment
WO2024201804A1 (en) Relay device
JPWO2019224860A1 (en) Communication device, communication method and communication program
US20230261973A1 (en) Method for distributing multipath flows in a direct interconnect network
CN114201311A (en) Data processing method and device
US20160112318A1 (en) Information processing system, method, and information processing apparatus
US20150271107A1 (en) Method and apparatus for protection switching based on memory control in packet transport system
US20150350138A1 (en) Controller, message delivery system, message delivery method, and program
US8441953B1 (en) Reordering with fast time out
Ma et al. Exploring low-latency interconnect for scaling out software routers
EP3229145A1 (en) Parallel processing apparatus and communication control method