NVM Express over PCIe Gen4 Baby, the U.2 NoLoad™ and Bluefield Edition!
- Written by: Stephen Bates
In a previous blog I was very pleased to announce that Eideticom, in partnership with Xilinx, IBM and Rackspace, performed the first public demo of NVM Express at PCIe Gen4. In this blog I’d like to give a Mellanox ARM64 update to that work and tie it into the bringup of our U.2 NoLoadTM!
Bringing up the U.2 NoLoad™
We publicly announced the U.2 version of NoLoadTM at Open Compute Summit back in March 2018. Since then we have received our first working samples and have been working to bring them up in our lab. You can see how excited I am by these samples in Figure 1.
Figure 1: The author looking very excited by his U.2 NoLoadTM sample!
One of the features of the U.2 NoLoadTM is PCIe Gen4 capability. Since U.2 only supports four lanes of PCIe we’d like those 4 lanes to be as fast as can be. In order to test this we needed a PCIe Gen4 capable CPU (or switch) and a server with U.2 drive bays. Cue BlueWhale!
Hunting the (Blue)Whale
BlueWhale is a storage server designed by Mellanox and based on their new BlueField ARM64 SoC. BlueField has some pretty neat features including:
- ●A Linux capable ARM64 complex with MMU and IOMMU.
- ●DDR4 channels for memory access.
- ●Two integrated 100GbE RDMA capable NICs
- ●Many lanes of PCIe Gen4
You can learn more about BlueField here or via your local Mellanox rep.
Figure 2: Bluefield with NoLoadTM 250-U.2 and SSDs.
BlueWhale is a 2U server that is designed with U.2 NVMe devices in mind. It connects a BlueFIeld SoC to a NVMe backplane that accepts up to 16 U.2 NVMe end-points. You can learn more about BlueWhale here and Mellanox were cool enough to lend us one (see Figure 3).
Figure 3: The BlueWhale 2U server. The 32 lanes of the BlueField SoC connect to two x16 slots that are cable connected to the backplane. Each slot can connect to 8 U.2 NVMe SSDs.
Show me the Money (the ARM64 Edition)!
As always the proof is in the pudding so we some fio testing of our U.2 NoLoadTM running NVMe inside BlueWhale over PCIe Gen4. Here are some of the things we noted.
- The BlueWhale server is running standard Linux. In this case Ubuntu 18.04. So you get a look and feel that is identical to any Linux based server and all your favourite tools, libraries and packages are available via things like apt install.
- You can install and run all your favourite NVMe related tools just like any Linux box. In our example we install nvme-cli and show how we can use that to identify the NVMe controllers and namespaces installed on the system.
- We can install and run fio on BlueWhale in the exact same way you can on any Linux box. Same scripts, same source code, just twice the PCIe/NVMe goodness ;-). In our demo we show about 5 GB/s of NVMe read bandwidth over 4 lanes of PCIe, a result that is impossible at PCIe Gen3 where about 3.5 GB/s is the upper limit.
- All the standard PCIe tools (like lspci) work as expected. We will dig into that point in a bit more detail below.
- All the awesome work being done by the Linux kernel development team around NVMe and the block layer comes for free thanks to the kernel deployed on the BlueWhale.
I did want to capture a screenshot from our testing and highlight some things in it (see Figure 3).
Figure 3: lspci -vvv for the U.2 Eideticom NoLoadTM inside the BlueWhale server. Some of the more interesting pieces of the output are highlighted and labelled.
Let’s go through the interesting parts of the lspci -vvv output in Figure 2.
- A- Eiditicom has registered its vendor ID and device IDs with the PCIe database. This means you get a human-readable description of the NoLoadTMin your system. Note this holds for any system, not just POWER or even Linux for that matter.
- B- The NoLoadTM has three PCIe BARs. BAR0 is 16KB and is the standard NVMe BAR that any legitimate NVMe device must have. The NVMe driver maps this BAR and uses it to control the NVMe device.
- C- The third BAR of NoLoadTM is unique in that it is a Controller Memory Buffer (CMB) which can be used for both NVMe queues and NVMe data. To our knowledge no other device yet supports both queues and data in its CMB. Also note our CMB is pretty big (512MB). We talked more about what we do with this CMB in a previous blog post.
- D- Thanks to our friends at Xilinx we can advertise PCIe Gen4 capability (i.e. 16GT/s). Note in systems that do not support Gen4 we will simply come up at PCIe Gen3.
- E- Thanks to our friends at Mellanox the CPU and server system support PCIe Gen4 so the link is up and running at Gen4 (16GT/s)! The U.2 NoLoadTM is x4 so our maximum throughput will be about 7 GB/s.
- F- Since our device is a NVMe device it is bound to the standard Linux kernel NVMe driver. This means we get all the goodness and performance of a stable, inbox driver that ships in all major OSes and all Linux distributions. No need to compile and insmod/modprobe a crappy proprietary driver thank you very much!
Where Next?
Eideticom has now demonstrated NVM Express at PCIe Gen4 speeds on both Power9 and ARM64 based platforms. As other CPUs reach the market with PCIe Gen4 we will be ready to test those too. Similarly as PCIe Gen4 switches become available we will test them also. The additional bandwidth provided by Gen4 is very useful to us.
We have demonstrated NVM Express at PCIe Gen4 speeds in both Add In Card (AIC) and U.2 form-factor NoLoadTMs. We can give our customers form-factor choice whilst satisfying their bandwidth requirements.
We are working on a range of accelerator services around storage services and analytics that can be utilized either to provide services in Fabric-attached JBOFs (or FBOFs as I like to call them) or as disaggregated accelerators that can be shared out over NVMe-oF networks.
Huge thanks to the folks at Mellanox who worked with us in getting these results. We will be doing a lot more testing on this platform in the coming months. Stay tuned for more.