Kubernetes Pod pending 很久问题分析及解决
背景
在 IDE 运行过程中,会启动一个 deployment 用做工作空间,然后在 pod ide-server。但是当性能压测的时候,出现了问题。
问题
当 ns 下存在 5w deployment 时,启动 500 个 Pod 从 create 到 running 耗时 5分30秒。不满足需求,需要优化。
现象
feiyan-1000000000 ws-cr3ipb4uccam7di1b9u0-6dbf68f48c-z6vc6 0/1 Pending 0 2m50s <none> <none> <none> <none>
feiyan-1000000000 ws-cr3ipb4uccam7di1b9u0-6dbf68f48c-zcbqn 0/1 Pending 0 2m52s <none> <none> <none> <none>
feiyan-1000000000 ws-cr3ipb4uccam7di1b9u0-6dbf68f48c-zfz8g 0/1 Pending 0 2m49s <none> <none> <none> <none>
feiyan-1000000000 ws-cr3ipb4uccam7di1b9u0-6dbf68f48c-zml6s 0/1 Pending 0 2m50s <none> <none> <none> <none>
feiyan-1000000000 ws-cr3ipb4uccam7di1b9u0-6dbf68f48c-zmmql 0/1 Pending 0 2m52s <none> <none> <none> <none>
Name: ws-cr3ipb4uccam7di1b9u0-6dbf68f48c-zmmql
Namespace: feiyan-1000000000
Priority: 0
Service Account: default
Node: <none>
Labels: app.kubernetes.io/instance=ws-cr3ipb4uccam7di1b9u0
app.kubernetes.io/name=ws-cr3ipb4uccam7di1b9u0
plugin=code-server
pod-template-hash=6dbf68f48c
workspace=ws-cr3ipb4uccam7di1b9u0
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/ws-cr3ipb4uccam7di1b9u0-6dbf68f48c
Containers:
code-server:
Image: eps-beijing.cr.xxx.com/infcprelease/ide-server:v1.4.0-2408131650
Port: 8910/TCP
Host Port: 0/TCP
Args:
server
--config=/etc/feiyan/config.yaml
--webide-workspace-id=cr3ipb4uccam7di1b9u0
Limits:
cpu: 2
memory: 4Gi
Requests:
cpu: 500m
memory: 512Mi
Liveness: http-get http://:http/healthz%3Ftype=liveness delay=0s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:http/healthz%3Ftype=readiness delay=0s timeout=1s period=10s #success=1 #failure=3
Environment:
USER_ID: 189331
GROUP_ID: 334242
WS_HOME: /home/runner/code
WS_ID: cr3ipb4uccam7di1b9u0
HOME: /home/runner
Mounts:
/etc/feiyan from config (ro)
/home/runner/code from ws-path (rw)
/nix from nix (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-bj4kb (ro)
Volumes:
ws-path:
Type: HostPath (bare host directory volume)
Path: /shared/ws/1000000000/luo
HostPathType: Directory
config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: feiyan-config
Optional: false
nix:
Type: HostPath (bare host directory volume)
Path: /shared/nix/nix
HostPathType: Directory
default-token-bj4kb:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-bj4kb
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
发现 event 是空的
常规检查
- Node 资源足够。(未出现任何 node 压力)。排除资源不够的可能性。
- 查看 Pod 状态。
查看 kube-scheduler 日志,发现调度一个 Pod 需要 700ms 左右。500 * 700ms = 350s,光是调度就需要花费350s,太慢了。
I0827 18:35:36.050146 1 trace.go:205] Trace[401520597]: "Scheduling" namespace:feiyan-1000000000,name:ws-cr48mgc00l3h770u0880-78b6b5bf5c-brl62 (27-Aug-2024 18:35:35.424) (total time: 625ms):
Trace[401520597]: ---"Prioritizing done" 625ms (18:35:00.050)
Trace[401520597]: [625.270367ms] [625.270367ms] END
I0827 18:35:36.813014 1 trace.go:205] Trace[1574660176]: "Scheduling" namespace:feiyan-1000000000,name:ws-cr48mgb1lsr0vb3dkji0-8645fd9665-wmvw9 (27-Aug-2024 18:35:36.050) (total time: 762ms):
Trace[1574660176]: ---"Prioritizing done" 762ms (18:35:00.812)
Trace[1574660176]: [762.728076ms] [762.728076ms] END
I0827 18:35:37.460196 1 trace.go:205] Trace[1439543403]: "Scheduling" namespace:feiyan-1000000000,name:ws-cr48mgb1lsr0vb3dkjm0-84c544f4b7-b4kcp (27-Aug-2024 18:35:36.813) (total time: 647ms):
Trace[1439543403]: ---"Prioritizing done" 646ms (18:35:00.460)
Trace[1439543403]: [647.020389ms] [647.020389ms] END
I0827 18:35:38.140289 1 trace.go:205] Trace[980518911]: "Scheduling" namespace:feiyan-1000000000,name:ws-cr48mg3lpur1r9npv5vg-76b6b8c9d5-5n8wb (27-Aug-2024 18:35:37.460) (total time: 679ms):
Trace[980518911]: ---"Prioritizing done" 679ms (18:35:00.140)
Trace[980518911]: [679.90811ms] [679.90811ms] END
从 kube-scheduler 入手
既然处理慢,我们就来看看做了什么。
// Schedule tries to schedule the given pod to one of the nodes in the node list.
// If it succeeds, it will return the name of the node.
// If it fails, it will return a FitError error with reasons.
func (g *genericScheduler) Schedule(ctx context.Context, fwk framework.Framework, state *framework.CycleState, pod *v1.Pod) (result ScheduleResult, err error) {
trace := utiltrace.New("Scheduling", utiltrace.Field{Key: "namespace", Value: pod.Namespace}, utiltrace.Field{Key: "name", Value: pod.Name})
defer trace.LogIfLong(100 * time.Millisecond)
...
trace.Step("Snapshotting scheduler cache and node infos done")
...
trace.Step("Computing predicates done")
...
trace.Step("Prioritizing done")
return ScheduleResult{
SuggestedHost: host,
EvaluatedNodes: len(feasibleNodes) + len(filteredNodesStatuses),
FeasibleNodes: len(feasibleNodes),
}, err
}