1 - Information ============================================ Enhancements were released to improve the reliability and availability of the NVIDIA Tesla GV100 GPU and the BCM5719 NIC for the POWER9 AC922 system. The drivers for these components must be updated as part of updating to OP910.21 using the steps enumerated in this guide. 2 - Systems affected ============================================ IBM Power System AC922 (8335-GTG) with the follow characteristics: * NVIDIA Tesla V100 GPU(s) * I/O Adapter: Dual port BCM5719 with shared port with BMC (NCSI) Adapter FW: 5719-v1.43 NCSI v1.3.12.0 And * System's FW level: OP910.00, OP910.10, or OP910.20 3 - General Information: ============================================ This Readme file is intended to give directions on how to: - Update OS to Red Hat Enterprise 7.5 LE for Power 9 with Z-stream 1; - Update NVIDIA Tesla CUDA driver to 396.26 from the CUDA 9.2 toolkit; - Fix BCM5719's NVRAM checksum; - Update firmware to OP910.21; and - Update BCM5719’s NCSI image; 4 - System requirements ============================================ a) Operating system: - Red Hat Enterprise Linux 7.5 LE for POWER9 (RHEL 7.5-Alt) with first Z-stream update. If current OS level is Red Hat Enterprise Linux 7.4 LE for POWER9, this needs to to be upgraded to RHEL 7.5-ALT with the process described herein; b) Tools: - ethtool - python c)Download files from Fix Central for AC922 (8335-GTG) for OP910.21: 1) fix_bcm_5719_crc.py script 2) python3_fix_bcm_5719_crc.py script (which is a python3 version) 3) lnxfwupg.zip - Broadcom driver update 4) nx1_ncsi_v1.4.22_PointDrop.zip - Broadcom NCSI driver image 5- Update Linux OS ============================================ The Linux OS must be updated if you are not at the minimum needed level of Red Hat Enterprise Linux 7.5 LE for Power 9 with Z-stream 1. RHEL 7.4 ALT is supported for the earlier AC922 service packs but it is not supported for the OP910.21 release level. The OS must be updated to the RHEL 7.5 ALT with the first Z-stream package (https://access.redhat.com/errata/RHSA-2018:1374 which is providing "kernel-alt-4.14.0-49.2.2.el7a.src.rpm") as the needed level. - Linux kernel upstream: v4.16 and forward will be compatible as newer Z-streams are released. To verify your current kernel, run: $ uname -r For details see Known Issues section. 6- Update the NVIDIA Tesla CUDA driver to 396.26 ================================================ The NVIDIA Tesla CUDA driver must be updated to 396.26 from the CUDA 9.2 toolkit. The Tesla CUDA driver can be obtained at the NVIDIA "http://www.nvidia.com/Download" site using the following information: Advanced Driver Search: Product Type: Tesla Product Series: V-Series Product: Tesla V100 Operating System: Linux POWER LE RHEL 7 CUDA Toolkit: 9.2 Language: English(US) Recommended/Beta: All For information on installing the NVIDIA driver (including pre- and post-install actions on POWER systems), refer to the Linux Installation Guide at: http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html 7- Verify current NCSI level ============================================ Before proceeding to the next steps, it's important to verify the current Broadcom adapter firmware and NCSI image. In order to do that you can run the following command line: # for i in `lspci -d "14e4:1657:0200" | cut -f1 -d" " `; do ethtool -i $(echo $(ls /sys/bus/pci/devices/$i/net)) | grep firmware-version ; done Example: # for i in `lspci -d "14e4:1657:0200" | cut -f1 -d" " `; do ethtool -i $(echo $(ls /sys/bus/pci/devices/$i/net)) | grep firmware-version ; done firmware-version: 5719-v1.43 NCSI v1.3.12.0 firmware-version: 5719-v1.43 NCSI v1.3.12.0 8- Fix Broadcom Adapter CRC ============================================ Fixing the Broadcom adapter's CRC checksum is done before the NCSI image update. The fix_bcm_5719_crc.py python script calculates the current CRC checksum and applies it, if indeed necessary, to the device. The script, 100% based on kernel's code [1], calculates the CRC32 checksum of the 0x74-0xFC. To use the script, you just need to run it as root, it will verify if your system is indeed a AC922 system, then it will look on every BCM5719 interface on it, performing an ethtool self-test, and in case it fails for NVRAM test, it will fix the CRC: $ python fix_bcm_5719_crc.py in case your system needs the fix you will see an output like: Analyzing device enP5p1s0f0. NVRAM test did't pass, checking current CRC and calculating correct value... enP5p1s0f0 Applying correct crc... Retrying NVRAM test... NVRAM test passed. A system or interface that doesn't need a fix will print the follow message: Analyzing device enP5p1s0f1. CRC matches! 9- Update system's firmware: ============================================ Update System's firmware to OP910.21 from Fix Central. 10- Update the Broadcom NCSI image: ============================================ To apply the new NCSI image, please run the command below: Unzip the lnxfwupg.zip $ unzip lnxfwupg.zip Archive: lnxfwupg.zip inflating: lnxfwupg/bmapilnx-17.0.6.sdk.tgz inflating: lnxfwupg/lnxfwupg-ppc64le.sdk.tgz Now extract both tgz files: $ cd lnxfwupg/ $ tar xzf lnxfwupg-ppc64le.sdk.tgz $ tar xzf bmapilnx-17.0.6.sdk.tgz Extract the image into the same directory: $ unzip nx1_ncsi_v1.4.22_PointDrop.zip Create a symlink libbmapi: $ ln -s libbmapi_x64.so.6-17.0.6 libbmapi_x64.so.6 $ ETH=$(for i in `lspci -d "14e4:1657:0200" | head -1 | cut -f1 -d" " `; do echo `ls /sys/bus/pci/devices/$i/net`; done) ./lnxfwupg $ETH upgrade -mgmt nx1ncsi1.422 and restart the system (or reload the tg3 driver, it seems to work doing just this, reboot is a sure bet) 11- Verify updated NCSI level ============================================ # for i in `lspci -d "14e4:1657:0200" | cut -f1 -d" " `; do ethtool -i $(echo $(ls /sys/bus/pci/devices/$i/net)) | grep firmware-version ; done Example: # for i in `lspci -d "14e4:1657:0200" | cut -f1 -d" " `; do ethtool -i $(echo $(ls /sys/bus/pci/devices/$i/net)) | grep firmware-version ; done firmware-version: 5719-v1.43 NCSI v1.4.22.0 firmware-version: 5719-v1.43 NCSI v1.4.22.0 12- Known Issues ============================================ After updating BCM5719's NCSI image, DO NOT move back to Redhat Enterprise 7.4-ALT LE for POWER9. There are incompatibilities between old versions of tg3 driver and the new NCSI image's heart beat mechanism that may compromise system connectivity. 13- References ============================================ [1] - https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/broadcom/tg3.c?h=v4.13#n9681