Readme file for Spark 3.3.1
package for IBM® Spectrum Conductor 2.5.1
Readme file for: IBM Spectrum Conductor
Product/Component release: 2.5.1
Fix ID: sc-2.5.1-build601552
Publication date: June 28, 2023
This enhancement provides the Apache Spark version 3.3.1 package integrated for
an IBM Spectrum Conductor 2.5.1 cluster.
Notes:
·
This
Spark 3.3.1 package used with Jupyter notebooks does
not support the R and Scala Spark cluster mode kernels. Support for this will
be available when the Jupyter Enterprise Gateway and
Apache Toree projects have releases supporting Spark
3.3 with R and Scala.
·
This Spark 3.3.1 version includes upgraded log4j version 1.x to log4j2 version 2.17.2 (see https://issues.apache.org/jira/browse/SPARK-38544), so follow the log4j2.properties file format when modifying logging
properties for a Spark 3.3.1 application.
·
Apache Spark 3.x introduces
custom resource scheduling with the ResourceProfile
concept and related APIs (see https://spark.apache.org/docs/3.3.1/configuration.html#custom-resource-scheduling-and-configuration-overview) which is not directly supported by this Spark 3.3.1 and IBM Spectrum
Conductor integration. Spark 3.3.1 is still fully integrated with IBM Spectrum
Conductor’s GPUs resource management feature (as described in https://www.ibm.com/docs/en/spectrum-conductor/2.5.1?topic=workload-gpus).
·
Apache Spark 3.3.1 introduces a new feature called stage
level scheduling”(see https://spark.apache.org/docs/3.3.1/configuration.html#stage-level-scheduling-overview), which is not directly
supported by this Spark 3.3.1 and IBM Spectrum Conductor integration. For
GPU-based resources, instead of using the RDD with Resources API, the similar stage level scheduling functionality on GPUs may be
implemented using IBM Spectrum Conductor’s GPU RDD API (as described in https://www.ibm.com/docs/en/spectrum-conductor/2.5.1?topic=samples-gpu-rdd)
·
This
Spark 3.3.1 package supports Spark applications with Python 3.10 so it can be
used for batch Spark workloads either when it’s specified directly in the PySpark application, or by default, when it is defined in
the conda environment (set in the Basic Settings
tab in the instance group’s configuration). Since IBM Spectrum Conductor’s Jupyter 6.x notebook servers currently cannot use Python
3.10 in Jupyter’s conda
environment (because of incompatible dependencies for 3rd party
packages), then when using Python 3.10 with Spark 3.3.1 in notebooks with Python Spark cluster mode kernels, set the full path in the Basic Settings tab for the conda environment’s Python 3.10 binary in the
NOTEBOOK_SPARK_PYSPARK_PYTHON environment variable (within the Configure
pop-up for Jupyter 6.0.0 component). For more details
see https://www.ibm.com/docs/en/spectrum-conductor/2.5.1?topic=variables-environment-jupyter-notebooks.
·
This
Spark 3.3.1 integration package runs on Java 8/11/17, Scala 2.12, Python 3.7+
(the latest tested Python 3.10 version) and R 3.5+. Java 8 versions prior to
version 8u201 support is deprecated. When using the Scala API, it is necessary
for applications to use the same Scala 2.12 version for which Spark was
compiled.
2. Products or components affected
3. Installation and configuration
4. List of files
5. Product notifications
6. Copyright and trademark information
Download the Spark 3.3.1 package from the following location: http://www.ibm.com/eserver/support/fixes/.
Component name, Platform, Fix
ID:
Spark 3.3.1, linux-x86_64 or
linux-ppc64le, sc-2.5.1-build601552
Prerequisites
Fix 601350 must be installed before this patch package.
1. Log on to the primary host in the
cluster as the cluster administrator:
egosh user
logon -u Admin -x Admin
2.
Stop the Elastic Stack related
services:
a.
Run egosh
service stop elk-shipper.
Verify that the elk-shipper service
is in DEFINED state:
egosh service list -ll | grep
elk-shipper | grep DEFINED
b.
Run egosh
service stop elk-indexer.
Verify that the elk-indexer
service is in DEFINED state.
c.
Run egosh
service stop elk-elasticsearch-master elk-elasticsearch-data elk-elasticsearch.
Verify that all these elk-elasticsearch services are in DEFINED state.
d. Run egosh service
stop elk-manager.
Verify that the elk-manager
service is in DEFINED state.
3. On your primary host, download and
extract the following packages to, for example, an /scfixes directory:
Linux-x86_64:
conductorsparkcore-2.5.1.0_x86_64_build601552.tar.gz
Linux-ppc64le:
conductorsparkcore-2.5.1.0_ppc64le_build601552.tar.gz
4. On your primary host, run the egoinstallfixes command to install the downloaded packages. For example:
Linux-x86_64:
egoinstallfixes
/scfixes/conductorsparkcore-2.5.1.0_x86_64_build601552.tar.gz
Linux-ppc64le:
egoinstallfixes
/scfixes/conductorsparkcore-2.5.1.0_ppc64le_build601552.tar.gz
Important: Running the egoinstallfixes command automatically backs up the current binary files to a fix
backup directory. For recovery purposes of the original file, do not delete
this backup directory. For more information on using this command, see
the egoinstallfixes command reference.
5. Run the pversions command to verify the
installation:
pversions
-b 601552
6. Back up the following file and remove the old one. For example:
tar -cvf backup_old_601552.tar
$EGO_CONFDIR/../../integration/elk/conf/indexer/cws_spark.conf
rm -rf
$EGO_CONFDIR/../../integration/elk/conf/indexer/cws_spark.conf
6. Complete below manual step:
cp
$EGO_TOP/integration/elk/activation/conductorsparkcore2.5.1/conf/indexer/cws_spark.conf
$EGO_CONFDIR/../../integration/elk/conf/indexer/cws_spark.conf
7. From the primary
host, restart the previously stopped services.
8. Download the sc-2.5.1.0_build601552.tgz package to a local directory. Decompress the file.
Once decompressed, you will have the following Spark package:
Spark3.3.1-Conductor2.5.1.tgz
9. Launch a browser and log in to the cluster management
console as a cluster administrator.
10. In case it already exists then remove the Spark 3.3.1
package from the cluster:
a. Click Resources > Frameworks > Spark
Management.
b. Select version 3.3.1.
c. Click Remove.
d. In the confirmation dialog, click Remove.
11. Add the new Spark 3.3.1 package to the cluster:
a. Click Resources > Frameworks > Spark Management.
b. Click Add.
c. Click Browse and select the Spark3.3.1-Conductor2.5.1.tgz package downloaded previously.
d. Click Add.
12. Enable notebook to support Spark 3.3.1:
a. Navigate to Resources > Frameworks
> Notebook Management.
b. Select the Jupyter
6.0.0 notebook and click Configure.
c. Click the Environment Variables tab.
d. Edit the value for supported_spark_versions to
include 3.3.1.
e. Click Update Notebook.
13. If the instance group configuration is set to a non-default OpenJDK
Java 11 version binary path for the JAVA_HOME parameter, then modify the Spark
configuration for a compatibility issue:
a. Click Workload > Instance Groups and select the new instance group created in the previous step.
b. In the instance group overview, click Manage > Configure.
c. Click the Spark tab, then the Configuration button, and search for SPARK_MASTER_OPTS.
d. For this found parameter: SPARK_MASTER_OPTS, enter this value:
‘--add-opens java.base/jdk.internal.ref=ALL-UNNAMED’
e. Click Close.
f. Click the Modify Instance Group button, start the instance group, and then run application.
Note: If installation is to a shared location, add -Xshare:off in the SPARK_MASTER_OPTS parameter with single quotation mark to disable the class shared data feature. For example:
‘--add-opens
java.base/jdk.internal.ref=ALL-UNNAMED
-Xshare:off’
14. If the instance group configuration is set to a
non-default OpenJDK Java 17 version binary path for the JAVA_HOME parameter,
then modify the Spark configuration for a compatibility issue:
a) Click Workload > Instance Groups and select the new instance group created in the previous step.
b) In the instance group overview, click Manage > Configure.
c) Click the Spark tab, then the Configuration button,
and search for spark.driver.extraJavaOptions
and spark.executor.extraJavaOptions.
d) For these two found parameters: spark.driver.extraJavaOptions and spark.executor.extraJavaOptions, enter below value for both of them:
‘--add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED
--add-opens=java.base/java.io=ALL-UNNAMED
--add-opens=java.base/java.net=ALL-UNNAMED
--add-opens=java.base/java.nio=ALL-UNNAMED
--add-opens=java.base/java.util=ALL-UNNAMED
--add-opens=java.base/java.util.concurrent=ALL-UNNAMED
--add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED
--add-opens=java.base/sun.nio.ch=ALL-UNNAMED
--add-opens=java.base/sun.nio.cs=ALL-UNNAMED
--add-opens=java.base/sun.security.action=ALL-UNNAMED
--add-opens=java.base/sun.util.calendar=ALL-UNNAMED
--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED’
e) Click Close.
f)
Click the Modify
Instance Group button, start the instance group, and then run application.
15. If required, upgrade your existing Spark instance groups to use the new Spark 3.3.1 package. For details, see https://www.ibm.com/docs/en/spectrum-conductor/2.5.1?topic=groups-using-updated-upgraded-components
sc-2.5.1.0_build601552.tgz
Spark3.3.1-Conductor2.5.1.tgz
conductorsparkcore-2.5.1.0_x86_64_build601552.tar.gz
integration/elk/activation/conductorsparkcore2.5.1/conf/indexer/cws_spark.conf
conductorsparkcore-2.5.1.0_ppc64le_build601552.tar.gz
integration/elk/activation/conductorsparkcore2.5.1/conf/indexer/cws_spark.conf
To receive information about product solution and patch updates automatically, subscribe to product notifications on the My Notifications page http://www.ibm.com/support/mynotifications/ on the IBM Support website (http://support.ibm.com). You can edit your subscription settings to choose the types of information you want to get notification about, for example, security bulletins, fixes, troubleshooting, and product enhancements or documentation changes.
© Copyright IBM Corporation 2023
U.S. Government Users Restricted
Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract
with IBM Corp.
IBM®, the IBM logo and ibm.com®
are trademarks of International Business Machines Corp., registered in many
jurisdictions worldwide. Other product and service names might be trademarks of
IBM or other companies. A current list of IBM trademarks is available on the
Web at "Copyright and trademark information" at www.ibm.com/legal/copytrade.shtml