IBM POWER9 Systems LC Server Firmware
Applies to: AC922 (8335-GTH) and AC922 (8335-GTX)
This document provides information about the installation of Licensed Machine or Licensed Internal Code, which is sometimes referred to generically as microcode or firmware.
This package provides firmware for the Power System AC922 (8335-GTH) and AC922 (8335-GTX) servers only.
The firmware level in this package is:
•OP920.20 / PNOR OP9_v2.0.14-2.6 and openBMC ibm-v2.3-476-r32
This section specifies the "Minimum ipmitool Code Level" required by the System Firmware for managing the system. OpenPOWER requires ipmitool level v1.8.15 or later to execute correctly on the OP910 and later firmware. It must be capable of establishing a IPMI v2 session with the ipmi support on the BMC.
Verify your ipmitool level on your Linux workstation using the following command:
bash-4.1$ ipmitool -V
ipmitool version 1.8.15
If you are need to update or add impitool to your Linux workstation , you can compile ipmitools (current level 1.8.15) for Linux as follows from Sourceforge:
1.1.1 Download impitool tar from http://sourceforge.net/projects/ipmitool/ to your linux system
1.1.2 Extract tarball on Linux system
1.1.3 cd to top-level directory
1.1.4 ./configure
1.1.5 make
1.1.6 ipmitool will be under src/ipmitool
You may also get the ipmitool package directly from your workstation Linux packages.
For specific fix level information on key components of IBM Power Systems LC and Linux operating systems, please refer to the documentation in the IBM Knowledge Center for the AC922 (8335-GTH) and AC922(8335-GTX):
https://www.ibm.com/support/knowledgecenter/en/POWER9/p9hdx/8335_gth_landing.htm
https://www.ibm.com/support/knowledgecenter/en/POWER9/p9hdx/8335_gtx_landing.htm
If using xCAT on the host OS to do firmware updates, the minimum xCAT level that should be used is 2.13.4 because it has stability improvements for the firmware update process. See the xCAT 2.13.4 release notes below for more information.
https://github.com/xcat2/xcat-core/wiki/XCAT_2.13.4_Release_Notes
The Linux OS has a NVIDIA CUDA driver that must be at recommended level 396.44 or later, or minimum level 396.26 to be compatible with OP920.00 and later levels. Without this driver, a GPU which has faulted and gone through a GPU reset can cause a Terminate Immediate (TI) for the system. The recommended level for the NVIDIA CUDA driver is level 396.44 to get ATS performance improvements.
The Power AC922 server delivers four Tesla V100 with NVLink GPUs supported in two processor sockets.
The Tesla CUDA driver can be obtained at the download NVIDIA link of “https://www.nvidia.com/content/DriverDownload-March2009/confirmation.php?url=/tesla/396.44/nvidia-driver-local-repo-rhel7-396.44-1.0-1.ppc64le.rpm&lang=us&type=Tesla”
The NVIDIA "http://www.nvidia.com/Download/index.aspx?lang=en-us" link using the following information can be used to do a manual search for the driver:
Manually find drivers for my NVIDIA products.
Product Type: Tesla
Product Series: V-Series
Product: Tesla V100
Operating System: Linux POWER LE RHEL 7
CUDA Toolkit: 9.2
Language: English(US)
Search results:
Version: 396.44
Release Date: 2018.8.6
Operating System: Linux POWER LE RHEL 7
CUDA Toolkit: 9.2
Language: English (US)
File Size: 47.28 MB
On Red Hat Enterprise Linux (RHEL) for PPC, RHEL-Alt 7.5, The Trusted Platform Module (TPM) device driver is not loaded automatically at boot time. Without this driver, the TPM device will not be accessible.
This affects any user-space application needing to access the TPM, as well as kernel security functions, such as the Integrity Measurement Architecture subsystem (IMA) in the Linux kernel. Without the TPM driver loaded, IMA will be unable to record trusted measurements to the TPM.
To load the driver manually, as root:
# modprobe tpm_i2c_nuvoton
To load the driver automatically at boot time:
# echo "tpm_i2c_nuvoton" > /etc/modules-load.d/tpm.conf"
The TPM device driver will be integrated as a built-in kernel module in a future release 7 of RHEL-Alt. Once this is done, it will be loaded automatically and this procedure will no longer be necessary.
Downgrading firmware from any given release level to an earlier release level is not recommended.
If you feel that it is necessary to downgrade the firmware on your system to an earlier release level, please contact your next level of support.
Concurrent Firmware Updates not available for LC servers.
Concurrent system firmware update is not supported on LC servers.
The 8335-GTG model is not supported for the OP920.xx release. However, the 8335-GTG may be upgraded to a 8335-GTH model by a SSR. These steps involve replacing the hardware processor features in the 8335-GTG and then updating to the alternative PNOR and BMC images, which can be found in Fix Central as part of the initial OP920.00 delivery. At the successful conclusion of the upgrade steps, the system model will be 8335-GTH with the OP920.00 release firmware. You may then update to this latest level of firmware.
The existing processors being replaced during a model or feature conversion become the property of IBM and must be returned. Feature conversions are always implemented on a "quantity of one for quantity of one" basis. Multiple existing features may not be converted to a single new feature. Single existing features may not be converted to multiple new features.
Feature conversions for 8335-GTG to 8335-GTH for processor features:
From FC | To FC | Return Parts? |
EP0K - 16-core 2.60 GHz (3.09 GHz Turbo) POWER9 Processor | EP0P - 16-core 2.7 GHz (3.3 GHz Turbo) POWER9 Processor | Yes |
EP0M - 20-core 2.0 GHz (2.87 GHz Turbo) POWER9 Processor | EP0R - 20-core 2.4 GHz (3.0 GHz Turbo) POWER9 Processor | Yes |
•witherspoon-IBM-OP9-v2.0-2.14-cfm_prod.pnor.squashfs.tar - alternative pnor image for use in the gtg to gth upgrade option.
•obmc-witherspoon-ibm-v2.1-cfm.ubi.mtd.tar - alternative bmc image for use in the GTG to GTH upgrade option.
(Note that these files exist only under the OP920.00 delivery in Fix Central)
Use the following examples as a reference to determine whether your installation will be concurrent or disruptive.
For the LC server systems, the installation of system firmware is always disruptive.
The BMC and PNOR image tar files are used to update the primary side of the PNOR and the primary side of the BMC only, leaving the golden sides unchanged.
Filename | Size | Checksum |
obmc-witherspoon-ibm-v2.3.ubi.mtd.tar | 24023040 | 5ff23b60c6cad376ce823b0237bfed11 |
witherspoon-IBM-OP9-v2.0.14-2.6_prod.pnor.squashfs.tar | 23418880 | d73b8bb54b65ad4757bb5d6a1d59a5c0 |
|
|
|
Note: The Checksum can be found by running the Linux/Unix/AIX md5sum command against the Hardware Platform Management (hpm) file (all 32 characters of the checksum are listed), ie: md5sum <filename>
After a successful update to this firmware level, the PNOR components and BMC should be at the following levels.
To display the PNOR level, use the following BMC command: "cat /var/lib/phosphor-software-manager/pnor/ro/VERSION | grep -A 12 IBM ".
The grep is needed to remove a security string at the start of the VERSION output for easier viewing of the PNOR level.
And the BMC command line command "cat" can be used to display the BMC level: "cat /etc/os-release".
Note: FRU information for the PNOR level does not show the updated levels via the fru command until the system has been booted once at the updated level.
PNOR firmware level: driver content
display pnor FW level using this cmd: "cat /var/lib/phosphor-software-manager/pnor/ro/VERSION"
IBM-witherspoon-ibm-OP9-v2.0.14-2.6-prod
op-build-v2.0.14-8-g91bf747
buildroot-2018.05.1-9-gc99f2ee
skiboot-v6.0.18
hostboot-c00d44a-pc8e37c7
occ-8fa3854
linux-4.17.12-openpower1-p88e50e5
petitboot-v1.7.5-p9a906c4
machine-xml-2d9c9f0
hostboot-binaries-hw031319a.op920
capp-ucode-p9-dd2-v4
sbe-b6ee17b
hcode-hw031319a.op920
openBMC level:
display BMC FW level via ssh session on the BMC , using this cmd root@witherspoon:~# cat /etc/os-release
ID: openbmc-phosphor
NAME: Phosphor OpenBMC (Phosphor OpenBMC Project Reference Distro)
VERSION: "ibm-v2.3"
VERSION_ID "ibm-v2.3-476-g2d622cb-r32-0-g9973ab0"
PRETTY_NAME "Phosphor OpenBMC (Phosphor OpenBMC Project Reference Distro) ibm-v2.3"
BUILD_ID "ibm-v2.3-476-g2d622cb-r32"
OP920 | |
|
|
OP9_v2.0.14-2.6/ OP920.20
04/04/2019 | Impact: Data Severity: HIPER
New features and functions
Support was added for alternate fan settings to increase cooling for air-cooled system configurations that have optical cables. In cases of lower processor and GPU usage but high I/O activity over the optical cables, the system may have an over-temperature warning near the I/O adapter because the fans are running too slowly with the default setting. The alternate "custom" setting for the higher fan speeds can be enabled using the following openbmctool commands at version 1.13 and later (this version supports reading the current/supported thermal modes and changing the current mode): Get list of available thermal control zones - openbmctool.py -H [BMC] -U root -P 0penBmc thermal zones Get the current and supported thermal modes - openbmctool.py -H [BMC] -U root -P 0penBmc thermal modes get -z 0 Set the thermal mode to "Custom" - openbmctool.py -H [BMC] -U root -P 0penBmc thermal modes set -z 0 -m Custom Set the thermal mode to "Default" - openbmctool.py -H [BMC] -U root -P 0penBmc thermal modes set -z 0 -m Default
System firmware changes that affect all systems
HIPER/Pervasive: A problem was fixed where, under certain conditions, a Power Management Reset (PM Reset) event may result in undetected data corruption. PM Resets occur under various scenarios such as power management mode changes between Dynamic Performance and Maximum Performance, power management controller recovery procedures, or system boot.
A problem was fixed for intermittent user and user level privilege errors for the OS ipmipower command. The following error message is issued: " privilege level cannot be obtained for this user".
A problem was fixed for a re-IPL failure of OPAL with BB821410 logged. This is an intermittent and infrequent error that can occur if Skiboot fails to get notified of the BMC mailbox shutdown prior to the re-IPL attempt. The problem can be circumvented by doing another IPL.
A problem was fixed for IPMI power down and power on raw commands failing when issued in IPMI Restriction Mode. For this error, the host goes unresponsive with the following SEL list and message logged to the BMC gui after the raw commands are issued from ipmitool: > ipmitool -I lanplus -H 9.40.192.54 -P 0penBmc sel list 195 | 02/05/2019 | 09:48:19 | System Event #0x01 | Undetermined system hardware failure | Asserted 196 | 02/05/2019 | 10:36:54 | System Event #0x01 | Undetermined system hardware failure | Asserted 197 | 02/05/2019 | 10:40:53 | System Event #0x01 | Undetermined system hardware failure | Asserted And the following error logged in the BMC GUI: FQPSPCR0023M: Hostboot has become unresponsive _PID=2868 MESSAGE=org.open_power.Host.Boot.Error.WatchdogTimedOut
A problem was fixed for not being able to change the "IPMI admin" password away from the BMC default. With the fix, the "IPMI admin" password can be changed using the ipmitool command. Note: The "IPMI admin" password is independent of the "normal admin" password on the BMC, such as that used by the REST APIs. When REST is used to change the admin password, it is changing the "normal admin" user ID password, not the "IPMI admin" user ID password. Consider changing both of the "admin" user passwords to provide better security. The following is an example of using the ipmitool command to change the "IPMI admin" user ID password (The "1" represents userid 1, which corresponds to the " IPMI admin" user): ipmitool user set password 1 your-new-IPMI-admin-password
Support was added to recognize a port parameter in the URL path for the Preboot eXecution Environment (PXE) in the ethernet adapters. Without the fix, there could be PXE discovery failures if a port was specified in the URL for the PXE.
A problem was fixed for a skiboot hang that could occur rarely for a i2C request if the i2c bus is in error or locked by the On-Chip Controller (OCC).
A problem was fixed for "Unexpected TCE size" error messages when Linux tried the default P9 PHB4 pages size and used the unsupported 2M and 1G page sizes. The TCE page size property is now set correctly with 4K/64K/16M and 256M supported.
A problem was fixed for PCIe ECC protection in the response data path for Power 9 processor parts. With the fix, PCIe ECC errors detected from the adjacent AIB (Adapter Interface Board) receive data path escalate to a checkstop so that the defective parts can be replaced.
A problem was fixed for an intermittent rare processor core lock failure that is not a real hardware problem. The erroneous failure looks like this in the logs: LOCK ERROR: Releasing lock we don't hold depth @0x30493d20 (state: 0x0000000000000001) [13836.000173140,0] Aborting! CPU 0000 Backtrace: S: 0000000031c03930 R: 000000003001d840 ._abort+0x60 S: 0000000031c039c0 R: 000000003001a0c4 .lock_error+0x64 S: 0000000031c03a50 R: 0000000030019c70 .unlock+0x54 S: 0000000031c03af0 R: 000000003001a040 .drop_my_locks+0xf4
A problem was fixed for the power-capping range allowed for the user. Changes were made to allow the user to access the entire powercap range, with two minimums exported into the OS: soft power cap minimum "powercap-min" and the hard power cap minimum limit "powercap-hard-min".
A problem was fixed for an OS reboot after a shutdown that intermittently fails after the shutdown. This can happen if the BMC is not ready to receive commands. With the fix, the messages to the BMC are validated and retried as needed. To recover from this error, the system can be rebooted from the BMC interface.
A problem was fixed for a kernel hard lock up that could occur if IPMI synchronous messages were sent from the OS to BMC while the BMC was rebooting. For these type of messages, a processor thread remains waiting in OPAL until a response is returned from the BMC.
A problem was fixed for a rare Nest Memory Management Unit (NMMU) hang calling out processor hardware incorrectly, masking the real cause of the problem which was an NPU failure. The incorrect error messages take this form on the system: 3 | FQPSPPU0093G | 2018-10-01 01:25:40 | Yes | Warning | CPU 1 has exceeded a correctable error threshold 4 | FQPSPPU0093G | 2018-10-01 03:20:55 | Yes | Warning | CPU 0 has exceeded a correctable error threshold 5 | FQPSPAA0008M | 2018-10-01 04:35:40 | Yes | Critical | Hostboot procedure callout
A problem was fixed for certain errors from OCC and Power Management not having a SEL log created when it should have for failures such as B1xx2AD3.
A problem was fixed for DDR4 2933 MHZ and 3200 MHZ DIMMs not defaulting to the 2666 MHZ speed on a new DIMM plug, thus preventing the system from IPLing.
Another fix was delivered for an intermittent IPL failure with BC131705 and BC8A1703 logged with a processor core called out. This is a rare error and does not have a real hardware fault, so the processor core can be unguarded and used again on the next IPL. This fix adds additional synchronization to the core threads over what was delivered in OP920.10.
A problem was fixed for memory bandwidth degradation that can occur if memory DIMMs have failed locations noted with symbol marks. Without the fix, the marked failed memory locations were wrongly subject to ECC corrector retries by the memory controller.
A problem was fixed for call home error logs from OCC having incorrect data regarding the memory bandwidth averages.
|
OP9_v2.0.10-2.2/ OP920.10
12/14/2018
| Impact: Data Severity: HIPER
New features and functions
The POWER9 default for the Spectre mitigation protection was changed to Kernel protection instead of Kernel + User protection. This was done to improve performance for user workloads such as Python scripts. The security vulnerabilities, CVE-2017-5753 and CVE-2017- 5715 (collectively known as Spectre) allow user-level code to infer data from unauthorized memory by using speculative execution to perform side-channel information disclosure attacks. If the default Spectre protection level had been changed on the system previously, this modified level of Spectre protection will persist across the firmware update.
The following are the steps that can be used to override the default Spectre protection to provide for more security by fully engaging the Spectre protection or provide for more performance by fully disengaging the Spectre protection. Note that disengaging the protections will leave the system vulnerable to attack via Spectre variant 2, and could result in data leakage and/or system compromise. The override is controlled by the BMC and requires a reboot of the POWER9 to take effect. To override the protection level: 1) Create/edit the "/var/lib/obmc/cfam_overrides" on BMC 2)Add the following contents: # Control speculative execution mode 0 0x283a 0x00000000 # bits 28:31 are used for init level – in this case 0 for Kernel + User 0 0x283F 0x20000000 # Indicate override register is valid 3) Re-IPL to apply changes. Key: init level 0 == Kernel + User protection (safest, old default) init level 1 == Kernel protection only (safest, new default) init level 2 == No protection
Support was added for increasing the number of BMC error logs from 100 to 200 and changing the error log to roll over old entries when full instead of stopping the logging of errors. Without this feature, the error log would get full at 100 entries and error logging would be stopped until some of the error logs were purged to make room for new entries.
System firmware changes that affect all systems
HIPER/Non-pervasive: Fixes included to address potential scenarios that could result in undetected data corruption.
A problem was fixed for an intermittent opal-prd crash that can happen on the host OS. This is the fault signature: " opal-prd[2864]: unhandled signal 11 at 0000000000029320 nip 00000 00102012830 lr 0000000102016890 code 1"
A problem was fixed for a PCI Host Bridge (PHB) configuration write error that caused the incorrect PCIe device to be frozen. The fault will be attributed to the last device to have a memory-mapped I/O operation (MMIO). With this fix, the freeze action for PHB configuration write errors is disabled in order to not impact functional hardware.
A problem was fixed for diagnostic code trying to read sensor values for PCI Host Bridge (PHB) entries that are unused, which causes debug output to have incorrect values for the unused entries. With the fix, only the used entries are processed by the diagnostic code.
A problem was fixed for a IPL loop/hang with a fatal MCE exception log caused by a probe of a failed PCI Host Bridge (PHB) that had been guarded. This is an infrequent error because it requires a PHB to have previously failed. The exception log has the following format: Fatal MCE at 000000003006ecd4 .probe_phb4+0x570 CFAR : 00000000300b98a0 <snip> Aborting! CPU 0018 Backtrace: S: 0000000031cc37e0 R: 000000003001a51c ._abort+0x4c S: 0000000031cc3860 R: 0000000030028170 .exception_entry+0x180 S: 0000000031cc3a40 R: 0000000000001f10 * S: 0000000031cc3c20 R: 000000003006ecb0 .probe_phb4+0x54c S: 0000000031cc3e30 R: 0000000030014ca4 .main_cpu_entry+0x5b0 S: 0000000031cc3f00 R: 0000000030002700 boot_entry+0x1b8
A problem was fixed for an intermittent error message when activating firmware during a firmware update. This extraneous error message occurred with moderate frequency. This is internal server 500 error message returned on the REST enumerate request. The error message can be ignored because there is not a problem with the firmware activate.
A problem was fixed for an On-Chip Controller (OCC) read failure with ERRNO=19 during a power off of the system. This intermittent problem is an extraneous errror log and can be ignored because the power off is successful.
A problem was fixed for an intermittent power on failure with message "Error in mapper call to get service name". To recover from this problem. power cycle the BMC and try the boot again.
A problem was fixed for not being able to set the Power Supply Redundancy by using a REST API command. Without the fix, this was a read-only attribute.
A problem was fixed for memory Over-Temperature (OT) throttling not occurring when a DIMM reaches the throttle temperature. Although the frequency to the memory DIMMs is not reduced, the fan speeds do increase to provide more cooling for the DIMMs.
A problem was fixed for error logs occurring on the IPL following a DIMM error recovery. These logs, related to failed memory scrubbing, have the following "Signature Description": "mba(n0p15c1) () ERROR: command complete analysis failed". These error logs do not indicate a hardware problem and may be ignored.
A problem was fixed for system termination for a re-IPL with power on. The system can be recovered by powering off and then IPLing. This problem occurs infrequently and can be avoided by powering off the system between IPLs.
A problem was fixed for certain system boot failures not propagating to the BMC before the boot firmware shuts down. Some details of the error log may still appear in the console output trace, but the details will not be available with the BMC queries. This problem is timing dependent and intermittently possible depending on the timing of the shutdown path. However, immediate shutdowns exacerbate the problem and increase the chance it can occur.
|
OP9_v2.0.8-2.7/ OP920.03
09/25/2018
| System firmware changes that affect all systems
A problem was fixed for a MSI-X checkstop in CAPI mode. This occurred intermittently when a DMA from the CAPI device targeted an address lower than 4GB and was confused for a 32-bit MSI operation. This is now avoided by disabling the 32-bit MSI when in CAPI mode.
A performance problem was fixed for certain cases of DMA operations from the GPU to an untranslated virtual memory location. With the fix, as much as a 10X performance improvement can occur for this type of DMA from the GPU.
|
OP9_v2.0.8-2.2/ OP920.02
08/14/2018
| Impact: Availability Severity: SPE
New features and functions
Support was added for parity error checking of the GPU data on the NVLink Datalink Layer (NDL), providing earlier memory fault detection and recovery retries to eliminate transient faults.
System firmware changes that affect all systems
A problem was fixed for PCIe4 CX5 adapter performance with an increase of performance of 40% for DMA read requests. The adapter affected is the Mellanox CX5 PCIe4 100Gb IB CAPI with feature codes #EC62 with CCIN 2CF1 and #EC64 with CCIN 2CF2. Without the fix, each read request requires a retry to work.
A problem was fixed for user code running on a GPU that can perform invalid commands to the MMIO space and cause an HMI that brings down the system. With the fix, ill-formatted commands to the MMIO space from the GPU will not be processed as a fatal exception but responses will be set to 0xFFFFFFFF and the GPU will receive a normal response code. The user GPU application can look for the bad response and fail, but the system will continue running without taking an HMI, allowing all other workloads to continue normally.
A problem was fixed for a GPU NVLINK writing out of range to a MMIO section of memory with byte-enabled writes that caused a machine check. With the fix, the out of range write is handled (detected) to cause a process core dump, but leaves the system in a usable state.
A problem was fixed for GPU workloads using unified memory with address translation service (ATS) sometimes hanging after resetting the GPUs. The trigger for the failure was putting the NPU in the fenced state via the "NPU Fence State" register with SCOM address 0x5011696. With the fix, the GPU fencing is handled using the NTL (NVLink Transaction Layer) reset register bits instead.
A problem was fixed for user applications timing out on the GPU operation for accessing the address translation service (ATS) unified memory, causing an HMI and system termination. With the fix, the ATSD timeout has been disabled, so the user applications can wait for GPU read or write operations to be completed without regard for the time needed for the operation.
A problem was fixed for the On-Chip Controller inadvertently disabling the MMIO ATSD flush bits, thereby potentially reducing the performance of the address translation service (ATS) unified memory for the GPU.
A problem was fixed for the SBE timer being stuck and unavailable to the host applications. This forces OPAL to use legacy timer loops for timers at the cost of additional processor bandwidth. Here are the messages that are logged for the problem that occurs on every boot: [ 194.494559313,3] SBE: Timer stuck, falling back to OPAL pollers. [ 194.494624185,3] SBE: You will likely have slower I2C and may have experienced increased jitter.
A problem was fixed for NPU log messages that were missing the CPU chip identifiers. With the fix, CPU taking the HMI (Host Maintenance Interrupt) is listed along with the NPU FIR register values.
A problem was fixed for various I2C devices such as DIMMs, SEEPROMs, and TPMs failing during the IPL with an I2C arbitration loss condition. This results in unusable hardware or IPL failures. Frequency of this intermittent error varies with the number of I2C devices in the system. A re-IPL of the system can recover from the problem.
|
OP9_v2.0.3-2.17 / OP920.01
06/22/2018 | Impact: Data Severity: HIPER
New features and functions
Support for the Simple Network Management Protocol (SNMP) was added to the BMC. SNMP is an Internet standard protocol for collecting and organizing information about managed devices on IP networks and for modifying that information to change device behavior.
Support for network configuration was added to the BMC web GUI.
System firmware changes that affect all systems
HIPER/Pervasive: A firmware change was made to address a rare case where a memory correctable error on POWER9 servers may result in an undetected corruption of data.
A problem has been fixed for a PCIe adaper running in CAPP mode having a missing MMIO Base Address Register (BAR) entry that causes a failure of the adapter and a fence off of two of the four ports of the adapter.
A problem has been fixed for a slow start up of a process that can occur when the system had been previously in an idle state.
A problem was fixed for a failure in DDR4 RCD (Register Clock Driver) memory initialization that causes half of the DIMM memory to be unusable after an IPL. This is an intermittent problem where the memory can sometimes be recovered by doing another IPL. The error is not a hardware problem with the DIMM but it is an error in the initialization sequence needed get the DIMM ready for normal operations.
SW429936(HOSTBOOT): A problem has been fixed for systems unexpectedly running with all processors at lower frequencies than would be expected for Workload Optimized Frequency (WOF) ultra-turbo mode.
SW429049(HCODE): A problem was fixed for a processor core that cannot be awakened or a timeout in the On-Chip Controller (OCC) when switching Workload Optimized Frequency (WOF) modes from disabled to enabled. These errors can cause a reduction in performance by running with fewer cores or by running at the safe mode frequencies
|
OP9_v2.0_2.14 / OP920.00
05/25/2018 | Impact: New Severity: New
New features and functions for MTM 8335-GTH and 8335-GTX:
GA Level
|
OS levels supported by the LC 8335 servers:
- Minimum level is Red Hat Enterprise Linux 7.5 for IBM Power LE (POWER9), also known as RHEL 7.5-ALT LE, with third Z-stream or later (https://access.redhat.com/errata/RHBA-2018:2467 has the needed kernel "kernel-alt-4.14.0-49.10.1.el7a.src.rpm"). The recommended level is RHEL 7.5-ALT LE, with fourth Z-stream or later (https://access.redhat.com/errata/RHSA-2018:2772 has the needed kernel "kernel-alt-4.14.0-49.13.1.el7a.src.rpm" ).
This RHEL level has fixes for ATS (Address Translation Service) for improved performance for the GPU access of memory.
- NVIDIA Telsa CUDA recommended driver level 396.44 or later, or minimum driver level 396.26 from the CUDA 9.2 toolkit
Additional OS level supported by the AC922(8335-GTH) server:
- Ubuntu 18.04, or later (with no GPU option)
IBM Power LC 8335 servers supports Linux which provides a UNIX like implementation across many computer architectures. Linux supports almost all of the Power System I/O and the configurator verifies support on order. For more information about the software that is available on IBM Power Systems, see the Linux on IBM Power Systems website:
http://www.ibm.com/systems/power/software/linux/index.html
The Linux operating system is an open source, cross-platform OS. It is supported on every Power Systems server IBM sells. Linux on Power Systems is the only Linux infrastructure that offers both scale-out and scale-up choices.
A supported version of Linux on the Power LC 8335 is Red Hat Enterprise Linux 7.5 for IBM Power LE (POWER9) (RHEL 7.5-ALT LE).
For additional questions about the availability of this release and supported Power servers, consult the Red Hat Hardware Catalog at
https://access.redhat.com/products/red-hat-enterprise-linux/#addl-arch.
For the AC922 (8335-GTH) that is configured without GPUs, there is the option of using Linux Ubuntu 18.04 or later as the OS.
For more information about Linux on Power, see the Linux on Power developer center at https://developer.ibm.com/linuxonpower/
For information about the features and external devices that are supported by Linux, see this website:
http://www.ibm.com/systems/power/software/linux/index.html
Use one of the following commands at the Linux command prompt to determine the current Linux level:
•cat /proc/version
•uname -a
The output string from the command will provide the Linux version level.
The opal-prd package on the Linux system collects the OPAL Processor Recovery Diagnostics messages to log file /var/log/syslog. It is recommended that this package be installed if it is not already present as it will help with maintaining the system processors by alerting the users to processor maintenance when needed.
On Red Hat Linux, perform command "rpm -qa | grep -i opal-prd ". The command output indicates the package is installed on your system if the rpm for opal-prd is found and displayed. This package provides a daemon to load and run the OpenPOWER firmware's Processor Recovery Diagnostics binary. This is responsible for run-time maintenance of Power hardware. If the package is not installed on your system, the following command can be run on Red Hat to install it:
sudo yum update opal-prd
To display the PNOR level, use the following BMC command: "cat /var/lib/phosphor-software-manager/pnor/ro/VERSION | grep -A 12 IBM ".
The grep is needed to remove a security string at the start of the VERSION output for easier viewing of the PNOR level.
And the BMC command line command "cat" can be used to display the BMC level: "cat /etc/os-release".
Note: the "cat" commands are run after ssh to the BMC as root and the default password is 0penBmc (where 0 is the zero character).
Follow the instructions on Fix Central. You must read and agree to the license agreement to obtain the firmware packages.
The updating and upgrading of system firmware depends on several factors, such as the current firmware that is installed, and what operating systems is running on the system.
These scenarios and the associated installation instructions are comprehensively outlined in the firmware section of Fix Central, found at the following website:
http://www.ibm.com/support/fixcentral/
Any hardware failures should be resolved before proceeding with the firmware updates to help insure the system will not be running degraded after the updates.
The process of updating firmware on the OpenBMC managed servers is documented below.
The sequence of events that must happen is the following:
•Power off the Host
•Update and Activate BMC
•Update and Activate PNOR
•Reboot the BMC (applies new BMC image)
•Power on the Host (applies new PNOR image)
The OpenBMC firmware updates (BMC and PNOR) for the LC 8335 servers can be managed via the command line with the openbmctool.
The openbmctool is obtained using the IBM Support Portal.
1.Go to the IBM Support Portal.
2.In the search field, enter your machine type and model. Then click the correct product support entry for your system.
3.From the Downloads list, click the openbmctool for your machine type and model.
4.Follow the instructions to install and run the openbmctool. You will need to provide the file locations of the BMC firmware image tar and PNOR firmware image tar that must be downloaded from Fix Central for the update level needed.
Information on the openbmctool and the firmware update process can be found in the IBM Knowledge Center:
https://www.ibm.com/support/knowledgecenter/POWER9/p9ei8/p9ei8_update_firmware_openbmctool.htm .
The service processor, or baseboard management controller (BMC), provides a hypervisor and operating system-independent layer that uses the robust error detection and self-healing functions that are built into the POWER processor and memory buffer modules. OpenPOWER application layer (OPAL) is the system firmware in the stack of POWER processor-based Linux-only servers.
The service processor, or baseboard management controller (BMC), is the primary control for autonomous sensor monitoring and event logging features on the LC server.
The BMC supports the Intelligent Platform Management Interface (IPMI) for system monitoring and management. The BMC monitors the operation of the firmware during the boot process and also monitors the OPAL hypervisor for termination.
The OpenPOWER Abstraction Layer (OPAL) provides hardware abstraction and run time services to the running host Operating System.
For the 8335 servers, only the OPAL bare-metal installs can be used.
Find out more about OPAL skiboot here:
https://github.com/open-power/skiboot
The Intelligent Platform Management Interface (IPMI) is an open standard for monitoring, logging, recovery, inventory, and control of hardware that is implemented independent of the main CPU, BIOS, and OS. The LC 8335 servers provide one 10M/100M baseT IPMI port.
The ipmitool is a utility for managing and configuring devices that support IPMI. It provides a simple command-line interface to the service processor. You can install the ipmitool from the Linux distribution packages in your workstation, sourceforge.net, or another server (preferably on the same network as the installed server).
For installing ipmitool from sourceforge, please see section 1.1 "Minimum ipmitool Code Level".
For more information about ipmitool, there are several good references for ipmitool commands:
The man page
The built-in command line help provides a list of IPMItool commands:
# ipmitool help
You can also get help for many specific IPMItool commands by adding the word help after the command:
# ipmitool channel help
For a list of common ipmitool commands and help on each, you may use the following link:
www.ibm.com/support/knowledgecenter/linuxonibm/liabp/liabpcommonipmi.htm
To connect to your host system with IPMI, you need to know the IP address of the server and have
a valid password. To power on the server with the ipmitool, follow these steps:
1. Open a terminal program.
2. Power on your server with the ipmitool:
ipmitool -I lanplus -H bmc_ip_address -P ipmi_password power on
3. Activate your IPMI console:
ipmitool -I lanplus -H bmc_ip_address -P ipmi_password sol activate
Petitboot is a kexec based bootloader used by IBM POWER9 systems for doing the bare-metal installs on the 8335 servers.
After the POWER9 system powers on, the petitboot bootloader scans local boot devices and network interfaces to find boot options that are available to the system. Petitboot returns a list of boot options that are available to the system. If you are using a static IP or if you did not provide boot arguments in your network boot server, you must provide the details to petitboot. You can configure petitboot to find your boot with the following instructions:
https://www.ibm.com/support/knowledgecenter/linuxonibm/liabp/liabppetitbootadvanced.htm
You can edit petitboot configuration options, change the amount of time before Petitboot automatically boots, etc. with these instructions:
https://www.ibm.com/support/knowledgecenter/linuxonibm/liabp/liabppetitbootconfig.htm
After you select to boot the ISO media for the Linux distribution of your choice, the installer wizard for that Linux distribution walks you through the steps to set up disk options, your root password, time zones, and so on.
You can read more about the petitboot bootloader program here:
https://www.kernel.org/pub/linux/kernel/people/geoff/petitboot/petitboot.html
This guide helps you install Linux on Power Systems server.
Overview
Use the information found in http://www.ibm.com/support/knowledgecenter/linuxonibm/liabw/liabwkickoff.htm to install Linux on a non-virtualized (bare metal) IBM Power LC server.
Date | Description |
08/02/2019 | Updated instructions for retrieving PNOR version |
04/04/2019 | Added OP920.20 |
12/14/2018 | Added OP920.10 |
10/02/2018 | Republished PNOR binaries |
09/25/2018 | Added OP920.03 |
08/14/2018 | Added OP920.02 |
06/22/2018 | Added OP920.01 |
06/01/2018 | OP920.00 delivery - Added files to support an upgrade from 8335-GTG to 8335-GTH |
05/25/2018 | New for AC922 LC servers for the OP920.00 release |