쿠버네티스에 spark 배포하기

Apache Spark는 분산 데이터 처리 플랫폼으로 대규모 데이터 분석에 널리 사용됩니다.
이 글에서는 Kubernetes 클러스터에 Spark를 배포하는 방법을 단계별로 설명합니다.
Kubernetes와 Spark를 결합하면, 컨테이너 기반 환경에서 Spark 애플리케이션을 유연하게 배포하고 관리할 수 있습니다.

정작 배포는 했지만 어떻게 사용해야되는지 모릅니다ㅠ… 하나씩 알아가봐야겠죠 뭐ㅎㅎ

Contents

spark와 Kubernetes

Apache Spark는 데이터 분석, 스트리밍, 머신러닝 작업을 위한 분산 처리 플랫폼입니다.
Kubernetes는 컨테이너 오케스트레이션 도구로, 다양한 애플리케이션을 쉽게 배포하고 관리할 수 있게 해줍니다.
Kubernetes에서 Spark를 배포하면, 다음과 같은 이점을 얻을 수 있습니다.

유연한 배포: 다양한 인프라에서 동일한 환경으로 애플리케이션 실행 가능.

자동 스케일링: 필요에 따라 Spark 클러스터를 동적으로 확장.

리소스 격리: 여러 작업을 컨테이너 단위로 분리하여 충돌을 방지.

spark 설치

spark-operator

spark-operator를 이용하여 간단하게 배포해볼 계획입니다.

https://github.com/kubeflow/spark-operator

https://kubeflow.github.io/spark-operator

https://kubeflow.github.io/spark-operator/docs/quick-start-guide.html

첫번째 링크를 참고하여 진행합니다.
두번째, 세번째 링크는 참고자료입니다.

Helm Chart 다운로드

helm repo add spark-operator https://kubeflow.github.io/spark-operator
helm repo update

설치가능한 차트를 조회합니다.

helm search repo spark-operator

spark-operator는 차트가 하나이므로 고민없이 배포 할수 있습니다.

소스 다운로드

차트 구성을 살펴볼겸 소스를 내려받습니다.

# Download helm chart
helm pull spark-operator/spark-operator --untar

차트 설정

values.yaml에서 웹ui접근을 위한 설정을 합니다.
배포 후 Ingress와 관련된 추가 설정이 필요 합니다.
enable을 true로 변경하고 urlFormat에 도메인정보를 입력합니다.

uiIngress:
  # -- Specifies whether to create ingress for Spark web UI.
  # `controller.uiService.enable` must be `true` to enable ingress.
  enable: true # false를 변경
  # -- Ingress URL format.
  # Required if `controller.uiIngress.enable` is true.
  urlFormat: "spark.icurfer.dev"

차트 배포

해당 설정을 적용하여 배포합니다.
values.yaml파일 경로는 본인의 파일 경로에 맞게 변경합니다.

helm install spark-operator spark-operator/spark-operator --namespace spark-operator --create-namespace --wait -f /config/workspace/opensource-test/spark/onK8s/spark-operator/values.yaml

수정적용할때 방법(참조)

helm upgrade --install spark-operator spark-operator/spark-operator --namespace spark-operator -f /config/
workspace/opensource-test/spark/onK8s/spark-operator/values.yaml

아래와 같이 배포 상태를 확인 할수 있습니다.
(Rancher를 이용한 모니터링화면으로 일반 쿠버네티스 CLI와 다른 화면입니다.)

Ingress-nginx배포

제가 배포한 spark는 kubeflow가 배포된 쿠버네티스 환경에 함께 배포되어있습니다.
kubeflow의 istio와 충돌을 피하기위하여 별도의 ingress-nginx를 배포합니다.

https://kubernetes.github.io/ingress-nginx/deploy/#bare-metal-clusters

제 개인 테스트환경은 온프레미스에 구성되어있습니다.
개인적으로 MetalLB보다 HAProxy와 NodePort구성을 선호하여 NodePort타입으로 ingress-nginx를 배포합니다.
(ingress-nginx관련된 상세한 내용은 생략합니다.)

배포가 완료되면 ingressClassName: nginx를 default로 구성해주어야 합니다.

kubectl annotate ingressclass nginx ingressclass.kubernetes.io/is-default-class="true"

적용하면 아래와 같이 설정이 적용 됩니다.

spark-pi 샘플 앱 테스트

spark-pi.yaml 파일을 생성하고 아래 내용을 입력합니다.

#
# Copyright 2017 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: spark-pi
  namespace: default
spec:
  type: Scala
  mode: cluster
  image: spark:3.5.3
  imagePullPolicy: IfNotPresent
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: local:///opt/spark/examples/jars/spark-examples.jar
  arguments:
  - "5000" # 이 부분을 늘리면 지연시킬수 있음.
  sparkVersion: 3.5.3
  driver:
    labels:
      version: 3.5.3
    cores: 1
    memory: 512m
    serviceAccount: spark-operator-spark
  executor:
    labels:
      version: 3.5.3
    instances: 1
    cores: 1
    memory: 512m

배포합니다.

kubectl apply -f spark-pi.yaml

spark가 동작하고 있는 동안에는 ui확인이 가능합니다.

끝~!