[Kubernets] Pod 디버깅 - 오류 정보 확인하고 해결하기

DevOps & Infra/Kubernetes

[Kubernets] Pod 디버깅 - 오류 정보 확인하고 해결하기

턴태 2023. 11. 9. 18:45

Kubernetes로 서버 애플리케이션을 배포하는데 오류가 발생했다. set image로 새로 빌드한 이미지를 넣어서 사용하는데 CrashLoopBackOff 상태가 되면서 컨테이너가 계속 생성되고 정지되고를 반복하고 있었다.

그래서 원인을 먼저 파악하는 것이 좋을 것 같아서 다방면으로 어떻게 접근하면 원인을 파악할 수 있을까 둘러보면서 도움이 될 만 하거나 정리하면 좋을 것 같은 내용들을 남긴다.

디버깅

파드에 문제가 발생했을 때는 세 가지 오브젝트에 관해 문제 상황을 확인해야 한다.

파드에 문제가 발생
레플리케이션컨트롤러에 문제가 발생
서비스에 문제가 발생

파드 디버깅

파드에 문제가 있는지 확인한다. 가장 간단하게 확인할 수 있는 방법은 describe다. 혹은 get resource로 먼저 대략적인 파드와 서비스의 상태를 파악한 후에 확인해도 좋다.

kubectl describe pods <pod이름>

그러면 아래와 같이 해당 리소스의 상태를 확인할 수 있다.

$ kubectl describe pod/simple-api-7574d7f6b4-bfwpl
Name:             simple-api-7574d7f6b4-bfwpl
Namespace:        default
Priority:         0
Service Account:  default
Node:             minikube/192.168.49.2
Start Time:       Thu, 09 Nov 2023 17:05:52 +0900
Labels:           app=simple-api
                  pod-template-hash=7574d7f6b4
Annotations:      <none>
Status:           Running
IP:               10.244.0.165
IPs:
  IP:           10.244.0.165
Controlled By:  ReplicaSet/simple-api-7574d7f6b4
Containers:
  simple-api:
    Container ID:   docker://c48d1ea047c173dc616c986aea5f143e1db89d43ce2b7bfa2e8382c7e5bff074
    Image:          stae1102/kubernetes-server:1.1
    Image ID:       ########
    Port:           3000/TCP
    Host Port:      0/TCP
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Thu, 09 Nov 2023 17:27:16 +0900
      Finished:     Thu, 09 Nov 2023 17:27:16 +0900
    Ready:          False
    Restart Count:  9
    Environment Variables from:
      nodejs-config  Secret  Optional: false
    Environment:     <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4r6sm (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  kube-api-access-4r6sm:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                 From               Message
  ----     ------     ----                ----               -------
  Normal   Scheduled  21m                 default-scheduler  Successfully assigned default/simple-api-7574d7f6b4-bfwpl to minikube
  Normal   Pulled     20m (x5 over 21m)   kubelet            Container image "stae1102/kubernetes-server:1.1" already present on machine
  Normal   Created    20m (x5 over 21m)   kubelet            Created container simple-api
  Normal   Started    20m (x5 over 21m)   kubelet            Started container simple-api
  Warning  BackOff    97s (x95 over 21m)  kubelet            Back-off restarting failed container simple-api in pod simple-api-7574d7f6b4-bfwpl_default(52a0234d-e2c9-4bc0-b66e-b4c82805669c)

여기서 주목할 정보는 Containers.State.Reason, Containers.Last State, Containers.Ready, Conditions, Events가 있다.

먼저 State가 어떤 상태인지가 중요하다. 공식문서에는 총 네 가지 상태에 대해 솔루션을 제공한다.

1️⃣ Pending 상태

파드가 Pending 상태로 멈춰 있는 경우는, 노드에 스케줄 될 수 없음을 의미한다. 일반적으로 이것은 어떤 유형의 리소스가 부족하거나 스케줄링을 방해하는 다른 요인 때문이다.

파드가 스케쥴이 될 수 없기 때문에 발생하는 에러다. 이에 관해 두 가지 요인이 있을 수 있다.

리소스가 부족한 경우: 사용자 클러스터의 CPU 나 메모리가 고갈되었을 수 있다. 이러한 경우, 파드를 삭제하거나, 리소스 요청을 조정하거나, 클러스터에 노드를 추가해야 한다. 컴퓨트 자원 문서에서 더 많은 정보를 확인한다.
hostPort를 사용하고 있는 경우: 파드를 hostPort에 바인딩할 때, 파드가 스케줄링될 수 있는 장소 수 제한이 존재한다. 대부분의 경우 hostPort는 불필요하므로, 파드를 노출하기 위해서는 서비스(Service) 오브젝트 사용을 고려해 본다. hostPort가 꼭 필요하다면 클러스터의 노드 수 만큼만 파드를 스케줄링할 수 있다.

파드를 정의할 때 resource를 명시하여 리소스를 조정하는 방법은 오브젝트 명세 파일에 직접 할당할 컴퓨터 자원을 명시하는 것이다.

리소스의 단위는 다음 링크에서 확인할 수 있다. https://kubernetes.io/ko/docs/concepts/configuration/manage-resources-containers/#meaning-of-cpu

resources:
      requests:
        memory: "64Mi"
        cpu: "250m"
      limits:
        memory: "128Mi"
        cpu: "500m"

어느 정도 리소스를 할당할지는 kubenetes가 제공하는 메트릭으로 확인해보자. 쿠버네티스는 기본적으로 해당 파드의 리소스 사용량을 status에 포함하여 보낸다고 한다. 프로메테우스같은 메트릭 API를 사용하거나 모니터링 도구에서 확인하고 리소스를 할당하자.

2️⃣ Waiting 상태

본인은 이 Waiting 상태에 해당했다. 계속 파드를 재시작하면서 CrashLoopBackOff 상태로 있어서 waiting 상태인 듯하다.

waiting 상태는 여러 요인이 있지만, 가장 흔한 요인은 이미지 풀링(pulling) 실패다. 그래서 이미지를 잘 지정했는지, 푸시한 레지스트리에 해당 이미지가 잘 존재하는지 등을 확인하는 것이 좋다.

Last State의 Exit Code를 통해 대략적으로 에러가 발생한 맥락을 파악할 수 있다. Docker 공식문서에는 각 exit code에 관한 설명이 없어서 쿠버네티스 관련 서비스를 제공하는 프로그램의 블로그에서 각 종료 코드에 관한 설명을 참고했다. Bash Script의 Exit Code를 따라가는 거라서 비슷하다.

CODE #	NAME	WHAT IT MEANS
Exit Code 0	Purposely stopped	Used by developers to indicate that the container was automatically stopped
Exit Code 1	Application error	Container was stopped due to application error or incorrect reference in the image specification
Exit Code 125	Container failed to run error	The docker run command did not execute successfully
Exit Code 126	Command invoke error	A command specified in the image specification could not be invoked
Exit Code 127	File or directory not found	File or directory specified in the image specification was not found
Exit Code 128	Invalid argument used on exit	Exit was triggered with an invalid exit code (valid codes are integers between 0-255)
Exit Code 134	Abnormal termination (SIGABRT)	The container aborted itself using the abort() function.
Exit Code 137	Immediate termination (SIGKILL)	Container was immediately terminated by the operating system via SIGKILL signal
Exit Code 139	Segmentation fault (SIGSEGV)	Container attempted to access memory that was not assigned to it and was terminated
Exit Code 143	Graceful termination (SIGTERM)	Container received warning that it was about to be terminated, then terminated
Exit Code 255	Exit Status Out Of Range	Container exited, returning an exit code outside the acceptable range, meaning the cause of the error is not known

여기서 Exit Code 1이 애플리케이션 오류라서 로그를 확인해야겠다고 생각했다. 보통 0으로 숫자형을 나누거나 참조를 잘못하는 등의 이유로 발생하는 에러다.

쿠버네티스에서 컨테이너의 로그를 확인하는 방법은 도커와 비슷하며, 간단하다.

kubectl logs <리소스명>

여기서 파드의 로그를 확인했을 때, 다음과 같이 로그가 출력된 것을 확인했다.

$ kubectl logs simple-api-7574d7f6b4-bfwpl                                                                 ✔  system   minikube ⎈  17:11:59  
yarn run v1.22.19
$ node dist/main
node:internal/modules/cjs/loader:1051
  throw err;
  ^

Error: Cannot find module '/home/server/dist/main'
    at Module._resolveFilename (node:internal/modules/cjs/loader:1048:15)
    at Module._load (node:internal/modules/cjs/loader:901:27)
    at Function.executeUserEntryPoint [as runMain] (node:internal/modules/run_main:83:12)
    at node:internal/main/run_main_module:23:47 {
  code: 'MODULE_NOT_FOUND',
  requireStack: []
}

Node.js v20.9.0
error Command failed with exit code 1.

이유는 상당히 간단한데, yarn prod 명령어의 스크립트가 시작해야 할 파일을 잘못 지정하여 모듈을 찾지 못하는 에러가 발생한 것이다. 원래는 src 경로에서만 애플리케이션을 사용했는데, infra 폴더를 생성하고 여기서 참조를 만들어서 build 시 디렉터리가 달라졌다.

만약 이전 버전으로 되돌리고자 한다면, 파드를 디플로이먼트로 배포하고 rollout으로 이전 리비전으로 되돌아가는 것이 좋을 것이다!

3️⃣ Crashing 혹은 Unhealty 상태

일단 파드가 스케쥴되어 띄워졌을 경우 https://kubernetes.io/ko/docs/tasks/debug/debug-application/debug-running-pod/ 해당 링크에서 해결 방법을 확인하면 좋다.

위에서 언급한 로그 확인도 가능하며, 직접 컨테이너에 접속하는 것도 좋은 방법일 것 같다. 직접 셸 스크립트에 접속하려면 아래 커맨드를 입력하여 접속할 수 있다.

kubectl exec -it 파드명 -- /bin/bash

사용법은 다음과 같다.

kubectl exec ${POD_NAME} -c ${CONTAINER_NAME} -- ${CMD} ${ARG1} ${ARG2} ... ${ARGN}

예를 들어, 카산드라의 로그를 확인하려면 다음과 같이 명령어를 입력해준다.

kubectl exec cassandra -- cat /var/log/cassandra/system.log

혹은 kubectl debug로 디버그 컨테이너를 띄워 사용할 수도 있다.

kubectl debug -it 컨테이터 --image=이미지 --target=타겟컨테이너

4️⃣ Running 상태

종종 Running 상태임에도 불구하고 원하는대로 동작하지 않는 경우도 있다. 이 경우가 가장 무섭다...

이전에도 postgresql를 helm으로 설치했지만, 원하는 값이 오버라이드 되지 않아 커스텀 유저가 생성되지 않은 적이 있다. 보통 이 문제는 파드 상세가 잘못되었기 때문일 수 있다. 그래서 배포하기 전에 미리 해당 상세가 유효한지 미리 확인할 수 있다.

kubectl apply --validate -f 상세파일

고의로 command 키를 commadn으로 오타를 만들고 배포를 시도한다고 가정했을 때, 미리 유효한지 검사하여 다음과 같은 결과를 얻을 수 있었다.

$ kubectl apply -f api-k8s.yml --validate
The request is invalid: patch: Invalid value: "map[metadata:map[annotations:map[kubectl.kubernetes.io/last-applied-configuration:{\"apiVersion\":\"apps/v1\",\"kind\":\"Deployment\",\"metadata\":{\"annotations\":{},\"name\":\"simple-api\",\"namespace\":\"default\"},\"spec\":{\"replicas\":3,\"selector\":{\"matchLabels\":{\"app\":\"simple-api\"}},\"template\":{\"metadata\":{\"labels\":{\"app\":\"simple-api\"}},\"spec\":{\"containers\":[{\"commadn\":[\"echo 'test'\"],\"envFrom\":[{\"secretRef\":{\"name\":\"nodejs-config\"}}],\"image\":\"stae1102/kubernetes-server:1.1\",\"name\":\"simple-api\",\"ports\":[{\"containerPort\":3000}]}]}}}}\n]] spec:map[template:map[spec:map[]]]]": strict decoding error: unknown field "spec.template.spec.containers[0].commadn"

그리고, 리소스의 상세 파일을 apiserver를 통해 확인해보는 것도 좋다. postgresql 에러도 상세 파일을 다시 들여다 보면서 오버라이드가 되지 않았음을 확인했다.

kubectl get 오브젝트/리소스 -o yaml

서비스 디버깅하기

사실 파드와 직접적으로 관련된 방법은 아니지만, 서비스도 디버깅이 필요하다.

먼저, 서비스를 위한 엔드포인트가 존재하는지 확인한다. 모든 서비스 오브젝트에 대해, apiserver는 endpoints 리소스를 생성하고 사용 가능한(available) 상태로 만든다.

kubectl get endpoints <서비스명>

NAME           ENDPOINTS           AGE
guestbook-ui   10.244.0.151:80     4d
kubernetes     192.168.49.2:8443   9d
simple-api                         92m

엔드포인트의 수가 해당 서비스에 속하는 파드의 수와 일치하는지 확인한다.

✔️ 서비스에 엔드포인트가 없는 경우

엔드포인트가 없는 상태라면, 서비스가 사용 중인 레이블을 이용하여 파드 목록을 조회해 본다. 다음과 같은 레이블을 갖는 서비스를 가정한다.

...
spec:
  - selector:
     name: nginx
     type: frontend

kubectl get pods --selector=name=nginx,type=frontend

보통 선택자가 잘못 지정이 되었을 수 있고, 포트가 잘못 지정된 경우가 있을 수 있다.

정리하며

배포할 때 항상 무섭다. 내 컴퓨터에서는 됐는데? 라는 말이 항상 머리 속에 남는다. 그럴 때마다 이벤트와 로그를 보는 습관이 참 도움이 많이 된다. 이번에도 로그를 보면서 빠르게 문제를 파악하고 해결했다. 앞으로도 로그를 잘 보자~!

참고

https://kubernetes.io/ko/docs/tasks/debug/debug-application/debug-pods/#debugging-replication-controllers

파드 디버깅하기

이 가이드는 쿠버네티스에 배포되었지만 제대로 동작하지 않는 애플리케이션을 디버깅하는 방법을 소개한다. 이 가이드는 클러스터 디버깅에 대한 것은 아니다. 클러스터 디버깅에 대해서는

kubernetes.io

https://kubernetes.io/ko/docs/concepts/configuration/manage-resources-containers/#meaning-of-cpu

파드 및 컨테이너 리소스 관리

파드를 지정할 때, 컨테이너에 필요한 각 리소스의 양을 선택적으로 지정할 수 있다. 지정할 가장 일반적인 리소스는 CPU와 메모리(RAM) 그리고 다른 것들이 있다. 파드에서 컨테이너에 대한 리소

kubernetes.io

https://kubernetes.io/docs/tasks/debug/debug-application/debug-init-containers/

Debug Init Containers

This page shows how to investigate problems related to the execution of Init Containers. The example command lines below refer to the Pod as <pod-name> and the Init Containers as <init-container-1> and <init-container-2>. Before you begin You need to have

kubernetes.io

https://discuss.kubernetes.io/t/etcd-and-kube-apiserver-pods-in-crashloopbackoff-state-after-node-reboot/22152

Etcd and kube-apiserver pods in CrashLoopBackOff state after node reboot

Hi all! i am starting my journey to kubernetes and had an issue after the reboot of a control plane. When the node rebooted (after being cordonned and drained), two pods are not working anymore and I am not able to understand what to do for it… If you ca

discuss.kubernetes.io

https://docs.docker.com/engine/reference/run/#exit-status

Docker run reference

Configure containers at runtime

docs.docker.com

https://komodor.com/learn/exit-codes-in-containers-and-kubernetes-the-complete-guide/

Exit Codes in Containers & Kubernetes | Complete Guide | Komodor

Everything you need to know about exit codes used by container engines to indicate reasons for container termination.

komodor.com

저작자표시 비영리