Readme file for Spark 3.3.1 package for IBM® Spectrum Conductor 2.5.1

Readme file for: IBM Spectrum Conductor
Product/Component release: 2.5.1
Fix ID: sc-2.5.1-build601552

Publication date: June 28, 2023


This enhancement provides the Apache Spark version 3.3.1 package integrated for an IBM Spectrum Conductor 2.5.1 cluster.

 

Notes:

·      This Spark 3.3.1 package used with Jupyter notebooks does not support the R and Scala Spark cluster mode kernels. Support for this will be available when the Jupyter Enterprise Gateway and Apache Toree projects have releases supporting Spark 3.3 with R and Scala.

·       This Spark 3.3.1 version includes upgraded log4j version 1.x to log4j2 version 2.17.2 (see https://issues.apache.org/jira/browse/SPARK-38544), so follow the  log4j2.properties file format when modifying logging properties for a Spark 3.3.1 application.

·       Apache Spark 3.x introduces custom resource scheduling with the ResourceProfile concept and related APIs (see https://spark.apache.org/docs/3.3.1/configuration.html#custom-resource-scheduling-and-configuration-overview) which is not directly supported by this Spark 3.3.1 and IBM Spectrum Conductor integration. Spark 3.3.1 is still fully integrated with IBM Spectrum Conductor’s GPUs resource management feature (as described in  https://www.ibm.com/docs/en/spectrum-conductor/2.5.1?topic=workload-gpus).

·       Apache Spark 3.3.1 introduces a new feature called stage level scheduling”(see  https://spark.apache.org/docs/3.3.1/configuration.html#stage-level-scheduling-overview), which is not directly supported by this Spark 3.3.1 and IBM Spectrum Conductor integration. For GPU-based resources, instead of using the RDD with Resources API, the similar stage level scheduling functionality on GPUs may be implemented using IBM Spectrum Conductor’s GPU RDD API (as described in https://www.ibm.com/docs/en/spectrum-conductor/2.5.1?topic=samples-gpu-rdd)

·       This Spark 3.3.1 package supports Spark applications with Python 3.10 so it can be used for batch Spark workloads either when it’s specified directly in the PySpark application, or by default, when it is defined in the conda environment (set in the Basic Settings tab in the instance group’s configuration). Since IBM Spectrum Conductor’s Jupyter 6.x notebook servers currently cannot use Python 3.10 in Jupyter’s conda environment (because of incompatible dependencies for 3rd party packages), then when using Python 3.10 with Spark 3.3.1 in notebooks with Python Spark cluster mode kernels, set the full path in the Basic Settings tab for the conda environment’s Python 3.10 binary in the NOTEBOOK_SPARK_PYSPARK_PYTHON environment variable (within the Configure pop-up for Jupyter 6.0.0 component). For more details see https://www.ibm.com/docs/en/spectrum-conductor/2.5.1?topic=variables-environment-jupyter-notebooks.

·       This Spark 3.3.1 integration package runs on Java 8/11/17, Scala 2.12, Python 3.7+ (the latest tested Python 3.10 version) and R 3.5+. Java 8 versions prior to version 8u201 support is deprecated. When using the Scala API, it is necessary for applications to use the same Scala 2.12 version for which Spark was compiled.

 

 

Contents
1.     Download location 

2.     Products or components affected

3.     Installation and configuration

4.     List of files

5.     Product notifications 

6.     Copyright and trademark information

1.   Download location

Download the Spark 3.3.1 package from the following location: http://www.ibm.com/eserver/support/fixes/.

2.   Products or components affected

Component name, Platform, Fix ID:                                                    

Spark 3.3.1, linux-x86_64 or linux-ppc64le, sc-2.5.1-build601552

 

3.   Installation and configuration

Prerequisites

Fix 601350 must be installed before this patch package.

 

1.    Log on to the primary host in the cluster as the cluster administrator:

egosh user logon -u Admin -x Admin

 

2.     Stop the Elastic Stack related services:

a.       Run egosh service stop elk-shipper.
Verify that the elk-shipper service is in DEFINED state:  
egosh service list -ll | grep elk-shipper | grep DEFINED

b.       Run egosh service stop elk-indexer.
Verify that the elk-indexer service is in DEFINED state.

c.       Run egosh service stop elk-elasticsearch-master elk-elasticsearch-data elk-elasticsearch.
Verify that all these elk-elasticsearch services are in DEFINED state.

d.       Run egosh service stop elk-manager.
Verify that the elk-manager service is in DEFINED state.

3.     On your primary host, download and extract the following packages to, for example, an /scfixes directory:

   Linux-x86_64:

conductorsparkcore-2.5.1.0_x86_64_build601552.tar.gz

Linux-ppc64le:

conductorsparkcore-2.5.1.0_ppc64le_build601552.tar.gz

4.   On your primary host, run the egoinstallfixes command to install the downloaded packages. For example:

 

Linux-x86_64:

egoinstallfixes /scfixes/conductorsparkcore-2.5.1.0_x86_64_build601552.tar.gz

Linux-ppc64le:

egoinstallfixes /scfixes/conductorsparkcore-2.5.1.0_ppc64le_build601552.tar.gz

Important: Running the egoinstallfixes command automatically backs up the current binary files to a fix backup directory. For recovery purposes of the original file, do not delete this backup directory. For more information on using this command, see the egoinstallfixes command reference.

5.   Run the pversions command to verify the installation:

   pversions -b 601552

6.   Back up the following file and remove the old one. For example:

tar -cvf backup_old_601552.tar $EGO_CONFDIR/../../integration/elk/conf/indexer/cws_spark.conf

rm -rf $EGO_CONFDIR/../../integration/elk/conf/indexer/cws_spark.conf

6.   Complete below manual step:

cp $EGO_TOP/integration/elk/activation/conductorsparkcore2.5.1/conf/indexer/cws_spark.conf $EGO_CONFDIR/../../integration/elk/conf/indexer/cws_spark.conf

7.   From the primary host, restart the previously stopped services.

8.     Download the sc-2.5.1.0_build601552.tgz package to a local directory. Decompress the file. Once decompressed, you will have the following Spark package:

Spark3.3.1-Conductor2.5.1.tgz

9.     Launch a browser and log in to the cluster management console as a cluster administrator.

10.  In case it already exists then remove the Spark 3.3.1 package from the cluster:

a.     Click Resources > Frameworks > Spark Management.

b.     Select version 3.3.1.

c.     Click Remove.

d.     In the confirmation dialog, click Remove.

11.  Add the new Spark 3.3.1 package to the cluster:

a.     Click Resources > Frameworks > Spark Management.

b.     Click Add.

c.     Click Browse and select the Spark3.3.1-Conductor2.5.1.tgz package downloaded previously.

d.     Click Add.

12. Enable notebook to support Spark 3.3.1:

a.   Navigate to Resources > Frameworks > Notebook Management.

b.   Select the Jupyter 6.0.0 notebook and click Configure.

c.   Click the Environment Variables tab.

d.   Edit the value for supported_spark_versions to include 3.3.1.

e.   Click Update Notebook.

13. If the instance group configuration is set to a non-default OpenJDK Java 11 version binary path for the JAVA_HOME parameter, then modify the Spark configuration for a compatibility issue:

a.     Click Workload > Instance Groups and select the new instance group created in the previous step.

b.     In the instance group overview, click Manage > Configure.

c.     Click the Spark tab, then the Configuration button, and search for SPARK_MASTER_OPTS.

d.     For this found parameter: SPARK_MASTER_OPTS, enter this value:

‘--add-opens java.base/jdk.internal.ref=ALL-UNNAMED’

e.     Click Close.

f.       Click the Modify Instance Group button, start the instance group, and then run application.

Note: If installation is to a shared location, add -Xshare:off in the SPARK_MASTER_OPTS parameter with single quotation mark to disable the class shared data feature. For example:

‘--add-opens java.base/jdk.internal.ref=ALL-UNNAMED -Xshare:off

14.  If the instance group configuration is set to a non-default OpenJDK Java 17 version binary path for the JAVA_HOME parameter, then modify the Spark configuration for a compatibility issue:

a)       Click Workload > Instance Groups and select the new instance group created in the previous step.

b)       In the instance group overview, click Manage > Configure.

c)       Click the Spark tab, then the Configuration button, and search for spark.driver.extraJavaOptions and spark.executor.extraJavaOptions.

d)       For these two found parameters: spark.driver.extraJavaOptions and spark.executor.extraJavaOptions, enter below value for both of them:

‘--add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED’

e)       Click Close.

f)        Click the Modify Instance Group button, start the instance group, and then run application.

15.  If required, upgrade your existing Spark instance groups to use the new Spark 3.3.1 package. For details, see https://www.ibm.com/docs/en/spectrum-conductor/2.5.1?topic=groups-using-updated-upgraded-components

 

4.   List of files 

sc-2.5.1.0_build601552.tgz

Spark3.3.1-Conductor2.5.1.tgz

conductorsparkcore-2.5.1.0_x86_64_build601552.tar.gz

integration/elk/activation/conductorsparkcore2.5.1/conf/indexer/cws_spark.conf

conductorsparkcore-2.5.1.0_ppc64le_build601552.tar.gz

integration/elk/activation/conductorsparkcore2.5.1/conf/indexer/cws_spark.conf

5.   Product notifications

To receive information about product solution and patch updates automatically, subscribe to product notifications on the My Notifications page http://www.ibm.com/support/mynotifications/ on the IBM Support website (http://support.ibm.com). You can edit your subscription settings to choose the types of information you want to get notification about, for example, security bulletins, fixes, troubleshooting, and product enhancements or documentation changes. 

6.   Copyright and trademark information 

© Copyright IBM Corporation 2023

U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

IBM®, the IBM logo and ibm.com® are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at www.ibm.com/legal/copytrade.shtml