There is a requirement to use Spark Operator in a K8s cluster to run a spark job. The official image contains many vulnerabilities, including those due to Hadoop libraries. Let's build our own Spark Operator image.
To build our image, we'll need a spark image as a base image and a Golang image to build Spark Operator itself.
Building a Spark image without Hadoop using a specific version of Spark
RUN curl -L https://dlcdn.apache.org/spark/spark-3.5.1/spark-3.5.1-bin-without-hadoop.tgz -o spark-3.5.1-bin-without-hadoop.tgz \
&& tar -xvzf spark-3.5.1-bin-without-hadoop.tgz \
&& mv spark-3.5.1-bin-without-hadoop /opt/spark \
&& rm spark-3.5.1-bin-without-hadoop.tgz
We build the Spark Operator image, we will need several Hadoop libraries to run submit commands.
For example, the FIPS version build is given, the differences in the build and run commands.
For building on Go, the parameter GOEXPERIMENT=boringcrypto
is used
For running spark-submit, the java parameter for Bouncy Castle is used Djavax.net.ssl.trustStorePassword=password
You can build an image without FIPS changes.
To run spark-submit, we will add Hadoop libraries during the build process:
hadoop-client-runtime
hadoop-client-api
slf4j-api
entrypoint.sh
is used from the official Kubeflow repository https://github.com/kubeflow/spark-operator/blob/master/entrypoint.sh
Example Dockerfile for building Spark Operator
ARG SPARK_IMAGE=spark-3.5.1-bin-without-hadoop
ARG GOLANG_IMAGE=golang-1.21
ARG SPARK_OPERATOR_VERSION=1.3.1
ARG HADOOP_VERSION_DEFAULT=3.4.0
ARG HADOOP_TMP_HOME="/opt/hadoop"
ARG TARGETARCH=amd64
# Prepare spark-operator build
FROM ${GOLANG_IMAGE} as builder
WORKDIR /app/spark-operator
ARG SPARK_OPERATOR_VERSION
RUN curl -Ls https://github.com/kubeflow/spark-operator/archive/refs/tags/spark-operator-chart-${SPARK_OPERATOR_VERSION}.tar.gz | tar -xz --strip-components 1 -C /app/spark-operator
RUN GOTOOLCHAIN=go1.22.3 go mod download
# Build
ARG TARGETARCH
RUN CGO_ENABLED=0 GOOS=linux GOARCH=${TARGETARCH} GO111MODULE=on GOTOOLCHAIN=go1.22.3 GOEXPERIMENT=boringcrypto go build -a -o /app/spark-operator/spark-operator main.go
#Install Hadoop jars
ARG HADOOP_VERSION_DEFAULT
ARG HADOOP_TMP_HOME
RUN mkdir -p ${HADOOP_TMP_HOME}
RUN curl -Ls https://archive.apache.org/dist/hadoop/common/hadoop-${HADOOP_VERSION_DEFAULT}/hadoop-${HADOOP_VERSION_DEFAULT}.tar.gz | tar -xz --strip-components 1 -C ${HADOOP_TMP_HOME}
# Prepare spark-operator image
FROM ${ECR_URL}:${SPARK_IMAGE}
WORKDIR /opt/spark-operator
USER root
ENV PATH $JAVA_HOME/bin:$PATH
ENV SPARK_HOME="/opt/spark"
ENV JAVA_HOME="/opt/jdk-11.0.21"
ENV SPARK_SUBMIT_OPTS="${SPARK_SUBMIT_OPTS} -Djavax.net.ssl.trustStorePassword=password"
ENV PATH=${PATH}:${SPARK_HOME}/bin:${JAVA_HOME}/bin:
RUN yum update -y && \
yum install --setopt=tsflags=nodocs -y openssl && \
yum clean all
ARG HADOOP_TMP_HOME
COPY --from=builder ${HADOOP_TMP_HOME}/share/hadoop/client/hadoop-client-runtime-*.jar ${HADOOP_TMP_HOME}/share/hadoop/client/hadoop-client-api-*.jar ${HADOOP_TMP_HOME}/share/hadoop/common/lib/slf4j-api-*.jar /opt/spark/jars/
COPY --from=builder /app/spark-operator/spark-operator /opt/spark-operator/
COPY --from=builder /app/spark-operator/hack/gencerts.sh /usr/bin/
COPY entrypoint.sh /opt/spark-operator/
RUN chmod a+x /opt/spark-operator/entrypoint.sh
ENTRYPOINT ["/opt/spark-operator/entrypoint.sh"]
After the build, we still have several vulnerabilities in the Hadoop library hadoop-client-runtime:
Since without this library we'll not be able to run spark-submit, but the rest of the huge part of the vulnerabilities is removed along with the main Hadoop libraries.