面试题整理12----K8s中Pod创建常见错误
- 1. 镜像获取失败
- 1.1 ErrImagePull(镜像拉取错误)
- 1.2 ImagePullBackOff(镜像拉取退避)
- 1.3 故障复现
- 1.4 解决方法
- 1.5 确认恢复正常
- 2. Pending
- 2.1 镜像拉取失败
- 2.2 资源不足(CPU,内存)
- 2.2.1 故障复现
- 2.2.2 解决故障
- 2.3 资源不足(存储)
- 2.3.1 故障复现
- 2.3.2 故障修复
- 2.4 标签选择器或亲和
- 2.4.1 故障复现
- 2.4.2 故障修复
- 3. 补充:Pod常见状态及原因
- 3.1 ContainerCreating(容器创建中)
- 3.2 ErrImagePull(镜像拉取错误)
- 3.3 ImagePullBackOff(镜像拉取退避)
- 3.4 CrashLoopBackOff(崩溃循环退避)
- 3.5 Running - Ready(运行中 - 就绪)
- 3.6 Terminating(终止中)
- 3.7 Pending - ImagePullBackOff(待定 - 镜像拉取退避)
在Kubernetes中,Pod是核心资源对象,其稳定运行至关重要。然而,Pod可能会遇到各种错误状态,影响其正常运行。以下是一些常见错误及其解决方法:
1. 镜像获取失败
此错误通常是以ErrImagePull
和ImagePullBackOff
的错误出现.
1.1 ErrImagePull(镜像拉取错误)
Kubernetes 无法从镜像仓库拉取容器镜像。
可能的原因包括镜像名称错误、镜像不存在、认证失败、网络问题等。
1.2 ImagePullBackOff(镜像拉取退避)
类似于 ErrImagePull,但在多次尝试失败后,Kubernetes 会进入退避状态,等待一段时间后重试
1.3 故障复现
root@k8s-master01:~# kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx-deployment-d556bf558-9swpd 0/1 ImagePullBackOff 0 46m
nginx-deployment-d556bf558-d2482 0/1 ErrImagePull 0 46m
nginx-deployment-d556bf558-r4v4z 0/1 ErrImagePull 0 46m
root@k8s-master01:~# kubectl describe pods nginx-deployment-d556bf558-r4v4z |tail -10
Normal Scheduled 47m default-scheduler Successfully assigned default/nginx-deployment-d556bf558-r4v4z to k8s-node03
Warning Failed 46m kubelet Failed to pull image "nginx:1.14.2": failed to pull and unpack image "docker.io/library/nginx:1.14.2": failed to resolve reference "docker.io/library/nginx:1.14.2": failed to do request: Head "https://registry-1.docker.io/v2/library/nginx/manifests/1.14.2": dial tcp 162.125.32.13:443: connect: connection refused
Warning Failed 46m kubelet Failed to pull image "nginx:1.14.2": failed to pull and unpack image "docker.io/library/nginx:1.14.2": failed to resolve reference "docker.io/library/nginx:1.14.2": failed to do request: Head "https://registry-1.docker.io/v2/library/nginx/manifests/1.14.2": dial tcp 69.171.229.11:443: connect: connection refused
Warning Failed 45m kubelet Failed to pull image "nginx:1.14.2": failed to pull and unpack image "docker.io/library/nginx:1.14.2": failed to resolve reference "docker.io/library/nginx:1.14.2": failed to do request: Head "https://registry-1.docker.io/v2/library/nginx/manifests/1.14.2": dial tcp 157.240.11.40:443: connect: connection refused
Normal Pulling 44m (x4 over 47m) kubelet Pulling image "nginx:1.14.2"
Warning Failed 44m (x4 over 46m) kubelet Error: ErrImagePull
Warning Failed 44m kubelet Failed to pull image "nginx:1.14.2": failed to pull and unpack image "docker.io/library/nginx:1.14.2": failed to resolve reference "docker.io/library/nginx:1.14.2": failed to do request: Head "https://registry-1.docker.io/v2/library/nginx/manifests/1.14.2": dial tcp 108.160.165.48:443: connect: connection refused
Warning Failed 44m (x6 over 46m) kubelet Error: ImagePullBackOff
Warning Failed 12m (x4 over 28m) kubelet (combined from similar events): Failed to pull image "nginx:1.14.2": failed to pull and unpack image "docker.io/library/nginx:1.14.2": failed to resolve reference "docker.io/library/nginx:1.14.2": failed to do request: Head "https://registry-1.docker.io/v2/library/nginx/manifests/1.14.2": dial tcp 108.160.172.1:443: connect: connection refused
Normal BackOff 2m8s (x178 over 46m) kubelet Back-off pulling image "nginx:1.14.2"
1.4 解决方法
- 下载镜像
- 修改并上传本地harbor(或者保存到每个节点)
- 将deployment中image修改为内网harbor镜像
# 下载nginx镜像
root@k8s-master01:~/yaml# nerdctl pull nginx:1.14.2
docker.io/library/nginx:1.14.2: resolved |++++++++++++++++++++++++++++++++++++++|
index-sha256:f7988fb6c02e0ce69257d9bd9cf37ae20a60f1df7563c3a2a6abe24160306b8d: done |++++++++++++++++++++++++++++++++++++++|
manifest-sha256:706446e9c6667c0880d5da3f39c09a6c7d2114f5a5d6b74a2fafd24ae30d2078: done |++++++++++++++++++++++++++++++++++++++|
config-sha256:295c7be079025306c4f1d65997fcf7adb411c88f139ad1d34b537164aa060369: done |++++++++++++++++++++++++++++++++++++++|
layer-sha256:8ca774778e858d3f97d9ec1bec1de879ac5e10096856dc22ed325a3ad944f78a: done |++++++++++++++++++++++++++++++++++++++|
layer-sha256:27833a3ba0a545deda33bb01eaf95a14d05d43bf30bce9267d92d17f069fe897: done |++++++++++++++++++++++++++++++++++++++|
layer-sha256:0f23e58bd0b7c74311703e20c21c690a6847e62240ed456f8821f4c067d3659b: done |++++++++++++++++++++++++++++++++++++++|
elapsed: 826.6s total: 42.6 M (52.8 KiB/s)
root@k8s-master01:~/yaml# nerdctl tag nginx:1.14.2 harbor.panasonic.cn/nginx/nginx:1.14.2
# 将镜像推送至harbor仓库
root@k8s-master01:~/yaml# nerdctl push harbor.panasonic.cn/nginx/nginx:1.14.2
INFO[0000] pushing as a reduced-platform image (application/vnd.docker.distribution.manifest.list.v2+json, sha256:3d206f335adbabfc33b20c0190ef88cb47d627d21546d48e72e051e5fc27451a)
index-sha256:3d206f335adbabfc33b20c0190ef88cb47d627d21546d48e72e051e5fc27451a: done |++++++++++++++++++++++++++++++++++++++|
manifest-sha256:706446e9c6667c0880d5da3f39c09a6c7d2114f5a5d6b74a2fafd24ae30d2078: done |++++++++++++++++++++++++++++++++++++++|
config-sha256:295c7be079025306c4f1d65997fcf7adb411c88f139ad1d34b537164aa060369: done |++++++++++++++++++++++++++++++++++++++|
elapsed: 0.6 s total: 7.1 Ki (11.8 KiB/s)
root@k8s-master01:~/yaml# cat deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
#image: nginx:1.14.2 # 注释原镜像
# 使用harbor作为镜像
image: harbor.intra.com/nginx/nginx:1.14.2
ports:
- containerPort: 80
deployment.apps "nginx-deployment" deleted
root@k8s-master01:~/yaml# kubectl apply -f deployment.yaml
deployment.apps/nginx-deployment created
1.5 确认恢复正常
root@k8s-master01:~/yaml# kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx-deployment-8677887b4f-2h2rd 1/1 Running 0 36s
nginx-deployment-8677887b4f-j7kwj 1/1 Running 0 36s
nginx-deployment-8677887b4f-vfmfq 1/1 Running 0 36s
root@k8s-master01:~/yaml# kubectl describe pods nginx-deployment-8677887b4f-vfmfq |tail
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 49s default-scheduler Successfully assigned default/nginx-deployment-8677887b4f-vfmfq to k8s-node01
Normal Pulling 49s kubelet Pulling image "harbor.intra.com/nginx/nginx:1.14.2"
Normal Pulled 46s kubelet Successfully pulled image "harbor.intra.com/nginx/nginx:1.14.2" in 3.069s (3.069s including waiting). Image size: 44708492 bytes.
Normal Created 46s kubelet Created container nginx
Normal Started 46s kubelet Started container nginx
root@k8s-master01:~/yaml#
2. Pending
Pending是K8s最常见的一种错误状态,这个报错主要原因有:
- 镜像拉取失败
- 资源不足
- 调度约束
- 依赖不存在
2.1 镜像拉取失败
这个在1里面已经详细表述过了,常见会伴有ErrImagePull
和ImagePullBackOff
的报错.这里就不再复述
2.2 资源不足(CPU,内存)
这个故障的原因就是Pod做了资源限制或者由于亲和或者指定node等情况,出现CPU,内存资源不足.容器创建后无法提供于是就处于Pending的状态
2.2.1 故障复现
可以看到node节点的内存资源基本都是1G以下,那么当我们申请一个6G的内存作为requests,当pod创建被提交后,一直无法得到内存大于6G的node来调度pod,于是Pod的状态就一直处于Pending的状态.
root@k8s-master01:~/yaml# kubectl top nodes
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
k8s-master01 78m 0% 1203Mi 15%
k8s-node01 26m 0% 1091Mi 28%
k8s-node02 25m 0% 739Mi 19%
k8s-node03 24m 0% 701Mi 18%
root@k8s-master01:~/yaml# cat deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
#image: nginx:1.14.2
image: harbor.intra.com/nginx/nginx:1.14.2
ports:
- containerPort: 80
resources:
requests:
memory: "6Gi"
cpu: "1"
limits:
memory: "6Gi"
cpu: "1"
root@k8s-master01:~/yaml# kubectl apply -f deployment.yaml
deployment.apps/nginx-deployment created
root@k8s-master01:~/yaml# kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx-deployment-554d6d7fd9-62dn7 0/1 Pending 0 6s
nginx-deployment-554d6d7fd9-bcwvt 0/1 Pending 0 6s
nginx-deployment-554d6d7fd9-n9dnp 0/1 Pending 0 6s
root@k8s-master01:~/yaml# kubectl describe pod nginx-deployment-554d6d7fd9-n9dnp | tail -4
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 5m1s default-scheduler 0/4 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 3 Insufficient memory. preemption: 0/4 nodes are available: 1 Preemption is not helpful for scheduling, 3 No preemption victims found for incoming pod.
可以看到有Insufficient memory
的告警出现在日志中.说明内存不足
2.2.2 解决故障
经过我们对应用的测试,适当调整requests.memory的值,使得node节点有足够的资源进行调度,然后重新发布deployment使得配置内容生效.
root@k8s-master01:~/yaml# cat deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
#image: nginx:1.14.2
image: harbor.intra.com/nginx/nginx:1.14.2
ports:
- containerPort: 80
resources:
requests:
memory: "200Mi"
cpu: "1"
limits:
memory: "400Mi"
cpu: "1"
root@k8s-master01:~/yaml# kubectl apply -f deployment.yaml
deployment.apps/nginx-deployment configured
root@k8s-master01:~/yaml# kubectl get po
NAME READY STATUS RESTARTS AGE
nginx-deployment-5b696b7fc8-2gkgc 1/1 Running 0 66s
nginx-deployment-5b696b7fc8-8kt6p 1/1 Running 0 64s
nginx-deployment-5b696b7fc8-dm8jt 1/1 Running 0 67s
此时Pod状态都是Running了
2.3 资源不足(存储)
这种情况也非常常见,通常是CM,Secret或者PVC等存储资源在Pod中申明,但在Pod启动前并没有被正确创建.当Pod创建时无法引用这些资源,就停在Pending状态.
通常会有persistentvolumeclaim "xxxx--xxx not found.
的报错
2.3.1 故障复现
root@k8s-master01:~/yaml# cat nginx-nfs.yaml
---
apiVersion: v1
kind: Pod
metadata:
name: nginx-nfs-example
namespace: default
spec:
containers:
- image: harbor.panasonic.cn/nginx/nginx:1.14.2
name: nginx
ports:
- containerPort: 80
protocol: TCP
volumeMounts:
- mountPath: /var/www
name: pvc-nginx
readOnly: false
volumes:
- name: pvc-nginx
persistentVolumeClaim:
claimName: nfs-pvc-default
root@k8s-master01:~/yaml# kubectl apply -f nginx-nfs.yaml
pod/nginx-nfs-example created
root@k8s-master01:~/yaml# kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx-nfs-example 0/1 Pending 0 5s
root@k8s-master01:~/yaml# kubectl describe pod nginx-nfs-example |tail -5
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 16s default-scheduler 0/4 nodes are available: persistentvolumeclaim "nfs-pvc-default" not found. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling.
2.3.2 故障修复
添加pv和pvc资源提供给pod挂载
root@k8s-master01:~/yaml# cat nginx-nfs.yaml
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: nfs-pv
spec:
capacity:
storage: 200Mi
accessModes:
- ReadWriteMany
nfs:
path: /nfs
server: 192.168.31.104
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: nfs-pvc
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 200Mi
---
apiVersion: v1
kind: Pod
metadata:
name: nginx-nfs
namespace: default
spec:
containers:
- image: harbor.panasonic.cn/nginx/nginx:1.14.2
name: nginx
ports:
- containerPort: 80
protocol: TCP
volumeMounts:
- mountPath: /var/www
name: nfs-pvc
readOnly: false
volumes:
- name: nfs-pvc
persistentVolumeClaim:
claimName: nfs-pvc
应用配置后故障消除
root@k8s-master01:~/yaml# kubectl apply -f nginx-nfs.yaml
persistentvolume/nfs-pv created
persistentvolumeclaim/nfs-pvc created
pod/nginx-nfs created
root@k8s-master01:~/yaml# kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS VOLUMEATTRIBUTESCLASS REASON AGE
nfs-pv 200Mi RWX Retain Bound default/nfs-pvc <unset> 3s
pvc-0748bb20-1e4a-4741-845c-0bae59160ef6 10Gi RWX Delete Bound default/pvc-nfs-dynamic nfs-csi <unset> 32d
pvc-7a0bba72-8d63-4393-861d-c4a409d48933 2Gi RWO Delete Terminating test/nfs-pvc nfs-storage <unset> 32d
root@k8s-master01:~/yaml# kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE
nfs-pvc Bound nfs-pv 200Mi RWX <unset> 6s
pvc-nfs-dynamic Bound pvc-0748bb20-1e4a-4741-845c-0bae59160ef6 10Gi RWX nfs-csi <unset> 32d
root@k8s-master01:~/yaml# kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx-nfs 1/1 Running 0 9s
CM和Secret等资源也是类似.
2.4 标签选择器或亲和
这类故障通常是由于标签选择或者强亲和造成没有配置正确的node节点或node节点没有足够的资源
2.4.1 故障复现
给node节点打上worker=true的label,这是我们在配置deployment时错误的将nodeselector设置成了错误的值,这样pod状态就会变成Pending
root@k8s-master01:~/yaml# kubectl get nodes --label-columns worker=true
NAME STATUS ROLES AGE VERSION WORKER=TRUE
k8s-master01 Ready control-plane 94d v1.31.0
k8s-node01 Ready <none> 94d v1.31.0
k8s-node02 Ready <none> 94d v1.31.0
k8s-node03 Ready <none> 94d v1.31.0
root@k8s-master01:~/yaml# cat deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
nodeSelector:
worker: node8
containers:
- name: nginx
#image: nginx:1.14.2
image: harbor.intra.com/nginx/nginx:1.14.2
ports:
- containerPort: 80
resources:
requests:
memory: "6Gi"
cpu: "1"
limits:
memory: "6Gi"
cpu: "1"
root@k8s-master01:~/yaml# kubectl apply -f deployment.yaml
deployment.apps/nginx-deployment created
root@k8s-master01:~/yaml# kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx-deployment-86895b4d79-dm6z4 0/1 Pending 0 84s
nginx-deployment-86895b4d79-tptlw 0/1 Pending 0 84s
nginx-deployment-86895b4d79-v6bfh 0/1 Pending 0 84s
root@k8s-master01:~/yaml# kubectl describe pods nginx-deployment-86895b4d79-v6bfh | tail -5
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 104s default-scheduler 0/4 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 3 node(s) didn't match Pod's node affinity/selector. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling.
2.4.2 故障修复
这里一般2种做法.
- 修改deployment中的nodeselector改为正确值.
- 可能生成环境中不想停止应用,那么就给对应的节点打上指定的标签
我们这里修改yaml中的nodeSelector然后重新部署
root@k8s-master01:~/yaml# cat deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
nodeSelector:
worker: 'true'
containers:
- name: nginx
#image: nginx:1.14.2
image: harbor.intra.com/nginx/nginx:1.14.2
ports:
- containerPort: 80
resources:
requests:
memory: "200Mi"
cpu: "0.1"
limits:
memory: "500Mi"
cpu: "1"
root@k8s-master01:~/yaml# kubectl apply -f deployment.yaml
deployment.apps/nginx-deployment configured
root@k8s-master01:~/yaml# kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx-deployment-55cdb49d65-2jkxl 1/1 Running 0 2s
nginx-deployment-55cdb49d65-bdltk 1/1 Running 0 3s
nginx-deployment-55cdb49d65-cb44w 1/1 Running 0 5s
常见的一般就是这几种情况,基本就是依赖未实现造成的,一般用kubectl describe pods <POD_NAME>
就能发现问题,然后根据报错进行排错就可以了.
3. 补充:Pod常见状态及原因
常见的具体状态或事件
3.1 ContainerCreating(容器创建中)
- Kubernetes 正在创建 Pod 的容器,但尚未完成。
- 可能的原因包括等待存储卷挂载、配置网络等。
3.2 ErrImagePull(镜像拉取错误)
- Kubernetes 无法从镜像仓库拉取容器镜像。
- 可能的原因包括镜像名称错误、镜像不存在、认证失败、网络问题等。
3.3 ImagePullBackOff(镜像拉取退避)
类似于 ErrImagePull,但在多次尝试失败后,Kubernetes 会进入退避状态,等待一段时间后重试。
3.4 CrashLoopBackOff(崩溃循环退避)
容器启动后立即崩溃,并且 Kubernetes 正在尝试重启容器,但连续失败后进入退避状态。
可能的原因包括应用程序错误、配置错误、资源不足等。
3.5 Running - Ready(运行中 - 就绪)
Pod 中的所有容器都在运行,并且已经通过健康检查,可以接收流量。
3.6 Terminating(终止中)
Kubernetes 正在终止 Pod,可能是因为删除 Pod 或者节点维护等原因。
3.7 Pending - ImagePullBackOff(待定 - 镜像拉取退避)
Pod 处于 Pending 状态,并且因为镜像拉取失败进入退避状态。