SNIA-SDC19-FC-NVMe Deep Dive FC-NVMe-Advanced-Final6 FCIA 2020
SNIA-SDC19-FC-NVMe Deep Dive FC-NVMe-Advanced-Final6 FCIA 2020
SNIA-SDC19-FC-NVMe Deep Dive FC-NVMe-Advanced-Final6 FCIA 2020
DNS
– Service deployment
I(c)
T(s) Client/Server
Relationships are pre-
I(c) defined
FC Basics and Terminology
• Fibre Channel was traditionally defined with 3 classes of
service
– Class 1 – Worked like a telephone cross bar switch
• This Class has been deprecated and is no longer used
– Class 2 – Acknowledged datagram service
– Class 3 – Unacknowledged datagram service
• Class 3 is the service used by FC-NVMe
FC Basics and Terminology (cont.)
• Each unit of transmission is called a “Frame”
– A frame can be up to 2112 bytes
• Multiple Frames can be bundled into a “Sequence”
– A Sequence can be used to transfer a large amounts of data – possibly up to multi-
megabytes (instead of 2112 bytes for a single frame)
• An interaction between two Fibre Channel ports is termed an “Exchange”
– An Exchange consists of a “Request” Sequence
– And, a “Reply” Sequence
– Many protocols (including SCSI and FC-NVMe) use an Exchange as a single
command/response
– Individual frames within the same Exchange are guaranteed to be delivered in-
order
– Each Exchange may be sent on a different path through the fabric
• Different exchanges have no order guarantee
• Allows Switches to pick most efficient route in Fabric
FC Basics and Terminology (cont.)
• FCP (Fibre Channel Protocol) Data Transfer
– FCP is the Upper Layer Protocol originally specified to transport
SCSI over Fibre Channel
– The FCP Data Transfer protocol has since been adapted to
transport FC-SB (FICON) and now FC-NVMe
– Provides a high speed low latency transport
NVMe Background
Three NVMe phases Rack of Servers
NVMe SSD modules can plug directly into servers
1 Basic (PCIe-based)
NVMe in servers PCI Expander
(last few years) Or they can plug into PCIe expansion slots
PCIe
2
Basic NVMe
in storage backend
(production!)
…
Server
Server
S
C
S
I
FC FC S
C
S
I
Flash
Array
SATA
SAS
FC
3
NVMe over Fabrics
(demoing now)
…
Server
Server
N
V
M
e
FC FC
N
V
M
e
Flash
Array
PCIe
NVMe Basics and Terminology
• NVMe-oF – Shorthand for NVMe over Fabrics
• Host – The system which issues I/O commands to a Subsystem
• Subsystem – A non-volatile memory storage device
• Capsule – A unit of information exchange used in NVMe over
Fabrics. Contains NVMe commands and responses.
• Discovery Controller – The controller which contains NVMe-oF
specific discovery information
NVMe Basics and Terminology (cont.)
• SQ (Submission Queue) – The queue used to submit I/O commands
to the controller
• CQ (Completion Queue) – The queue used to indicate command
completions for any return data and completion status by the
controller
• Admin Queue – The queue used to submit Admin commands to the
controller
• SQE (Submission Queue Entry) – A submission to the Submission
or Admin Queue – Contains the command to be performed
• CQE (Completion Queue Entry) – A submission to the Completion
Queue containing any returned data and completion status
FC-NVMe
Take away from this section?
• Most important part
• Deep Dive into how NVMe over Fabrics is
mapped onto Fibre Channel
• In depth look at how discovery is done in FC-
NVMe
• Next Section
• FC-NVMe use cases in the data center
FC-NVMe
• Goals
• Comply with NVMe over Fabrics Spec
• High performance/low latency
• Use existing HBA and switch hardware
• Don’t want to require new ASICs to be spun to support
FC-NVMe
• Fit into the existing FC infrastructure as much as possible,
with very little real-time software management
• Pass NVMe SQE and CQE entries with no or little
interaction from the FC layer
• Maintain Fibre Channel metaphor for transportability
• Name Server
• Zoning
• Management
Performance
• The Goal of High Performance/Low Latency
• Means that FC–NVMe needs to use an
existing hardware accelerated data transfer
protocol
• FC does not have an RDMA protocol so FC-
NVMe uses FCP for the data transfer protocol
• Currently both SCSI and FC-SB (FICON)
use FCP for data transfers
• FCP is deployed as hardware accelerated
in most (if not all) HBAs
FCP Data Transfer Mapping
• NVMe-oF capsules (i.e., commands, and responses) are
directly mapped into FCP Information Units (IUs)
Data
2 • Data to a FCP Data IU
FCP Data IU(s) FCP Data IU(s) FCP Data IU(s)
Target CQE
3 • NVMe Completion Queue Entry
(CQE) to a FCP Response IU
FCP Response IU
I/O Operation
• Transactions for a particular I/O Operation are bundled into an
FC Exchange
Exchange (Read I/O OperaQon)
Read Command
Data
Response
Data
Response
FC-NVMe CMND IU Fields
• Format ID (FDh) and FC ID (28h)
– uniquely identifies a FC-NVMe (vs. SCSI or Byte 0 1 2 3
FC-SB) CMND IU Word
0 Format ID (FDh) FC ID (28h) CMND IU Length
• Flags – Indicates Read/Write 1 Reserved Flags
• Data Length – The total length of the data to be 6 – 21 NVMe SQE (64 Bytes)
read or written 22 Reserved
• Transferred Data Length – Indicates the 4–7 NVMe CQE (16 Bytes)
FC-NVMe CMD_IU
Memory Region SQE
#1
Memory Region
#3 Data from original SGL
is merged into single
DATA_IU payload
Memory Region
#n
SGL read example
NVMe SQE command contains SGL
poinQng to local desQnaQon data buffers
FC-NVMe CMD_IU
Memory
Memory Region SQE
Region
#1 #1
SQE length DATA_IU to be returned
Memory
Memory Region
Region
#2 #2 SGL
Memory
Memory Region SGL List converted to single entry
Region
#3 #3 SGL for placement into CMD_IU
• Points to offset into response
DATA_IU
Original SGL saved
Memory
Memory Region for response
Region
#n #n DATA_IU data
SGL read example (cont.)
In this example, local host HBA retains the SGL sent
with original command for direct data placement
Memory Region
#1
Memory Region
FC-NVMe DATA_IU SGL #2
Memory Region
#3
Data is sent as single
conQguous region from
Host/Subsystem
Memory Region
#n
Zero Copy
• Zero-copy
• Allows data to be sent to user application
with minimal copies
• RDMA is a semantic which encourages more efficient data
handling, but you don’t need it to get efficiency
• FC has had zero-copy years before there was RDMA
• Data is DMA’d straight from HBA to buffers passed to user
• Difference between RDMA and FC is the APIs
• RDMA does a lot more to enforce a zero-copy mechanism, but it
is not required to use RDMA to get zero-copy
FCP Transactions
SCSI IniQator SCSI Target
• FCP Transactions look
FCP FCP
similar to RDMA
I/F I/F
Model Model – For Read
• FCP_DATA from Target
– For Write
• Transfer Ready and then
IO Read IO Write DATA to Target
IniQator Target IniQator Target
FCP_CMD FCP_CMD
R_RDY
A FCP_XFE
FCP_DAT
FCP_DATA
P P
FCP_RS FCP_RS
QLogic ConfidenQal
Sept. 16, 2014
NVMe-oF Protocol Transactions
NVMe-oF NVMe-oF
IniQator Target • NVMe-oF over RDMA
protocol transactions
NVMe- NVMe-
oF I/F
Model
oF I/F
Model
– RDMA Write
– RDMA Read with RDMA
Read Response
IO Read IO Write
IniQator Target IniQator Target
NVMe-oF_C NVMe-oF_C
M D M D
ead
rite RDMA R
RDMA W
RDMA Read
Resp
F _RSP F_RSP
NVMe-o NVMe-o
QLogic ConfidenQal
Sept. 16, 2014
Discovery
• Two main types of FC topologies
– Point-to-point
• No FC switches/Fabric
• No FC Name Server
• No Zoning or other Fabric services
• Typically smaller configurations
– Fabric based
• 1 or more FC switches
• Discovery based around the FC
Name Server
• Zoning in effect
• Can be very large configurations
(thousands of ports)
Point-to-point discovery
• No FC Name Server
– Configurations are usually either
self discovered or statically
configured
– NVMe Discovery Controller may
be used for FC-NVMe
• Note: Less common topology
Fabric Discovery
• FC-NVMe Fabric Discovery uses both
• FC Name Server to identify FC-NVMe ports
• NVMe Discovery Service to disclose NVMe Subsystem
information for those ports
• This dual approach allows each component to manage the area
it knows about
• FC Name Server knows all the ports on the fabric and the type(s)
of protocols they support
• NVMe Discovery Service knows all the particulars about NVMe
Subsystems
FC-NVMe Discovery Example
• FC-NVMe IniQator connects to FC Name Server
FC Name
Server
FC-NVMe
IniQator
FC Fabric
FC-NVMe Discovery Example
• FC Name Server points to NVMe Discovery Controller(s)
FC Name
Server NVMe
FC-NVMe
Discovery
IniQator
FC Fabric Controller
FC-NVMe Discovery Example
• FC-NVMe IniQator connects to NVMe Discovery Controller(s)
FC Name
Server NVMe
FC-NVMe
Discovery
IniQator
FC Fabric Controller
FC-NVMe Discovery Example
• NVMe Discovery Controller(s) idenQfy available NVMe Subsystems
FC Name
Server NVMe
FC-NVMe
Discovery
IniQator
FC Fabric Controller
NVMe
Subsystem
FC-NVMe Discovery Example
• FC-NVMe IniQator connects to NVMe Subsystem(s) to begin data transfers
FC Name
Server NVMe
FC-NVMe
Discovery
IniQator
FC Fabric Controller
NVMe
Subsystem
Zoning and Management
• Of course, FC-NVMe also
works with
• FC Zoning
45
Use Cases
Traditional latency sensitive storage apps
• NVMe reduces latency of enterprise storage features declines, the (server side) latency savings of the
NVMe will moQvate latency sensiQve apps, like financial apps, to move to NVMe over FC
Enterprise featured
Financial NVMe Subsystem
App NSID
FC Fabric
Traditional IOPS sensitive storage apps
• The enhanced queuing capabiliQes of NVMe enable much more parallelism, even for exisQng apps.
As the number of server cores and threads grows, and the number of VMs explodes, there is increasing
need to exploit the potenQal parallelism in solid state targets. NVMe is designed to enable that.
Super Parallel
App(s) NVMe Subsystem
NSID NSID
vm vm vm NSID NSID
vm vm vm NSID NSID
vm vm vm
FC Fabric NSID NSID
NSID NSID
New latency sensitive “memory” apps
• The ultra-low latency of SSDs is corresponding with new apps, such as data mining or
machine learning. These apps have voracious appeQtes for low latency persistent memory,
beyond what fits in a server, and yet they oien do not require the “enterprise features” of
tradiQonal enterprise storage. We might call these “memory-oriented” applicaQons. They
can leverage the low latency of Fibre Channel to access massive low latency NVMe arrays
Latency-focused
Data Mining
NVMe Subsystem
App
NSID
FC Fabric
FC-NVMe / FC-SCSI dual protocol usage
• Database app maintains high value database on high SLA legacy array
• Data mining app requires super low latency reference image of DB
• Regularly Snapshot DB in legacy array
• Use Data mining server to copy snapshot to Ultra-low latency NSID
• Run Data mining applicaQon using low latency NSID reference copy
FC-SCSI Array
DB App DB master
Snapshot
FC Fabric
NVMe Subsystem
Data Mining
App
DB Reference copy
Wrapping it up
FC-NVMe
• Wicked Fast!
• Builds on 20 years of the most robust
storage network experience
• Can be run side-by-side with existing SCSI-
based Fibre Channel storage environments
• Inherits all the benefits of Discovery and
Name Services from Fibre Channel
• Capitalizes on trusted, end-to-end
Qualification and Interoperability matrices
in the industry
Milestone
• FC-NVMe completed 1st round
of approval within the T11.3
committee in August!
– Ratification of document as
technically stable
After this Webcast
• Please rate this event – we value your feedback
• We will post a Q&A blog at http://fibrechannel.org/ with
answers to all the great questions we received today
• Follow us on Twitter @FCIAnews
• Join us for our next live FCIA webcast:
Long-Distance Fibre Channel
October 10, 2017
10:00 am PT
Register at … https://www.brighttalk.com/webcast/14967/277327
Thank you!