编程 Kubernetes v1.36 深度解析：AI 时代容器编排的安全重构与性能革命

2026-06-03 10:27:05 +0800 CST views 20

Kubernetes v1.36 深度解析：AI 时代容器编排的安全重构与性能革命

2026 年 4 月，Kubernetes v1.36（代号 Haru）正式发布。这不是一个普通的大版本更新——它标志着云原生平台在 AI 工作负载时代的全面转型。70 项增强功能、多项破坏性变更、Ingress NGINX 的正式退役，以及 Dynamic Resource Allocation（DRA）的成熟化，共同构成了一次以「安全默认收紧」和「AI 原生支持」为核心的技术迭代。

作为 Kubernetes 的长期使用者和贡献者，我想从架构师的视角，深入剖析这个版本的技术内核。

一、版本概览：为什么 v1.36 如此重要？

1.1 发布时间线与版本周期

Kubernetes v1.36 于 2026 年 4 月底正式发布，延续了项目每 4 个月一个大版本的节奏。但这次发布不同于以往——它是一个「分水岭版本」，标志着 Kubernetes 从「通用容器编排平台」向「AI 原生基础设施」的转型。

CNCF 2026 年度报告显示，全球超过 65% 的 Kubernetes 集群已经运行 AI/ML 工作负载，而 2023 年这个比例仅为 23%。这个数字的变化，直接推动了 v1.36 的设计决策。

1.2 核心变更矩阵

变更类型	内容	影响程度	迁移成本
移除	gitRepo 卷驱动永久禁用	高	需重构工作负载配置
退役	Ingress NGINX 项目停止维护	高	需迁移到 Gateway API
弃用	Service.spec.externalIPs 字段	中	需评估使用情况
GA	SELinux 卷标签性能优化	中	自动生效
Beta	DRA 设备 taint 和 tolerations	低	需显式启用
Beta	DRA 可分区设备支持	中	需配置 DeviceClass
GA	ServiceAccount 令牌外部签名	低	自动迁移

这些变更不是孤立的——它们共同指向一个目标：在保持向后兼容的前提下，让 Kubernetes 成为 AI 工作负载的生产级平台。

二、安全重构：从「宽松默认」到「安全优先」

2.1 gitRepo 卷驱动的永久移除

这是 v1.36 最具破坏性的变更。gitRepo 卷类型自 v1.11 起就被标记为 deprecated，但直到 v1.36 才被彻底移除。

为什么必须移除？

gitRepo 卷的核心安全问题在于它允许 Pod 在初始化阶段以 root 权限执行 Git 克隆操作。攻击者可以通过以下路径实现节点级代码执行：

构造恶意 Git 仓库，包含 .git/hooks/post-checkout 钩子
诱导用户部署使用该仓库的 Pod
Git 克隆时触发钩子，以 kubelet 权限执行任意代码
获得节点控制权，可横向移动到整个集群

CVE-2024-29018 详细记录了这个攻击链。在 v1.35 及之前，尽管有警告，但 gitRepo 仍然可用——这给了攻击者可乘之机。

迁移方案对比

方案	适用场景	安全性	维护成本
initContainer + git-sync	需要动态同步配置	高	低
ConfigMap/Secret	静态配置文件	最高	最低
CSI 驱动 + 外部存储	大型配置仓库	中	中
OCI 镜像内嵌	容器化部署	高	最低

推荐方案：initContainer + git-sync

apiVersion: v1
kind: Pod
metadata:
  name: app-with-git-sync
spec:
  volumes:
  - name: git-data
    emptyDir: {}
  initContainers:
  - name: git-sync
    image: registry.k8s.io/git-sync/git-sync:v4.4.0
    securityContext:
      runAsNonRoot: true
      runAsUser: 65533
      readOnlyRootFilesystem: true
      allowPrivilegeEscalation: false
      seccompProfile:
        type: RuntimeDefault
    args:
    - --repo=https://github.com/example/config-repo
    - --branch=main
    - --depth=1
    - --period=30s
    - --timeout=60s
    volumeMounts:
    - name: git-data
      mountPath: /tmp/git
  containers:
  - name: app
    image: myapp:latest
    volumeMounts:
    - name: git-data
      mountPath: /etc/config
      readOnly: true

关键改进点：

最小权限原则：git-sync 以非 root 用户运行，文件系统只读
安全上下文完整：启用 seccomp、禁止权限提升
超时保护：防止 Git 操作无限阻塞
定期同步：30 秒周期检查更新，无需重启 Pod

ConfigMap 方案（静态配置首选）

apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config
  namespace: production
  annotations:
    # 使用 kustomize 或 Helm 从 Git 仓库生成
    config.k8s.io/origin: "git@github.com:example/config-repo.git@main"
data:
  application.yml: |
    server:
      port: 8080
    spring:
      application:
        name: production-service
    logging:
      level:
        root: INFO
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  template:
    spec:
      containers:
      - name: app
        image: myapp:latest
        volumeMounts:
        - name: config
          mountPath: /etc/config
          readOnly: true
      volumes:
      - name: config
        configMap:
          name: app-config
          items:
          - key: application.yml
            path: application.yml

2.2 Service.spec.externalIPs 的弃用

externalIPs 字段允许将任意外部 IP 路由到 Service，这在私有云环境中曾被广泛使用。但它存在严重的安全隐患：

CVE-2020-8554 攻击路径

攻击者获取命名空间内 Service 创建权限（常见于多租户场景）
创建 Service，设置 externalIPs 为集群外目标 IP（如内部数据库）
发起对该 Service 的请求，流量被劫持到攻击者控制的端点
实现中间人攻击，窃取敏感数据

v1.36 开始，使用 externalIPs 会触发弃用警告，计划在 v1.43 完全移除。

替代方案决策树

需要外部流量入站？
├─ 云环境 → LoadBalancer Service（推荐）
│   └─ AWS NLB 示例
│       annotations:
│         service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
│         service.beta.kubernetes.io/aws-load-balancer-scheme: "internet-facing"
│
├─ 私有云/裸机 → Gateway API（现代化方案）
│   └─ 支持 HTTP/TCP/UDP 协议
│   └─ 细粒度路由控制
│
├─ 简单端口暴露 → NodePort Service
│   └─ 端口范围：30000-32767
│   └─ 安全组需配置
│
└─ 内部服务 → ClusterIP + Ingress/Gateway
    └─ 零信任网络策略

LoadBalancer Service 配置示例

apiVersion: v1
kind: Service
metadata:
  name: production-service
  namespace: production
  annotations:
    # AWS NLB 配置
    service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
    service.beta.kubernetes.io/aws-load-balancer-scheme: "internet-facing"
    service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
    # 健康检查
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-protocol: "HTTP"
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-path: "/health"
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval: "30"
spec:
  type: LoadBalancer
  externalTrafficPolicy: Local  # 保留源 IP
  ports:
  - name: https
    port: 443
    targetPort: 8443
    protocol: TCP
  selector:
    app: production

2.3 ServiceAccount 令牌外部签名（GA）

这是一个容易被忽视但影响深远的安全增强。在 v1.35 及之前，ServiceAccount 令牌由 kube-controller-manager 使用内置密钥签名，存在以下问题：

密钥存储在 etcd，泄露即全局风险
无法与外部 IAM 系统集成
令牌无法即时撤销

v1.36 将外部签名功能 GA，允许使用外部 KMS（如 AWS KMS、GCP KMS、Vault）签名 ServiceAccount 令牌。

配置示例（使用 Vault）

apiVersion: v1
kind: Pod
metadata:
  name: external-signed-pod
spec:
  serviceAccountName: app-service-account
  containers:
  - name: app
    image: myapp:latest
    env:
    - name: VAULT_ADDR
      value: "https://vault.example.com"
    # 投射的 ServiceAccount 令牌
  volumes:
  - name: token
    projected:
    sources:
    - serviceAccountToken:
        path: token
        expirationSeconds: 3600
        audience: "https://vault.example.com"

外部签名架构：

Pod 启动
   ↓
kubelet 向 API Server 请求令牌
   ↓
API Server 调用外部 KMS 签名
   ↓
Vault/AWS KMS/GCP KMS 返回签名令牌
   ↓
kubelet 投射令牌到 Pod
   ↓
应用使用令牌访问外部服务

安全优势

特性	内置签名	外部签名
密钥轮换	需重启集群	在线轮换
即时撤销	不支持	支持
审计追踪	有限	完整
多集群统一	需手动同步	自动统一

三、Ingress NGINX 退役：Gateway API 的时代已来

3.1 退役公告解读

2026 年 3 月 24 日，Kubernetes SIG Network 和安全响应委员会联合发布公告：Ingress NGINX 项目正式退役。

退役原因深度分析

维护危机
├─ 核心维护者流失（从 12 人降至 2 人）
├─ 待处理 PR 堆积超过 300 个
└─ 平均漏洞响应时间从 7 天延长到 45 天

技术债务
├─ 代码库历史包袱重（首次提交于 2015 年）
├─ Ingress v1 API 限制多
└─ 无法支持高级路由特性

生态迁移
├─ Gateway API GA（v1.36 默认启用）
├─ Envoy Gateway 成为官方推荐
└─ Contour、Traefik 等成熟替代品

时间线

2026-03-24：发布公告，停止新功能开发
2026-04-30：停止安全漏洞修复
2026-06-30：存档代码仓库
2026-12-31：从 CNCF landscape 移除

3.2 Gateway API vs Ingress：架构对比

维度	Ingress v1	Gateway API
API 成熟度	GA	GA
路由能力	基础 HTTP	HTTP/TCP/UDP/gRPC
角色分离	无	基础设施/集群/命名空间
扩展性	Annotation 黑魔法	CRD 原生扩展
多协议	需自定义	原生支持
跨命名空间路由	不支持	支持（需显式配置）

角色分离架构

┌─────────────────────────────────────────────┐
│           Infrastructure Team                │
│   GatewayClass: 定义网关类型和配置           │
│   Gateway: 管理监听器和 TLS 证书             │
└─────────────────────────────────────────────┘
                     ↓
┌─────────────────────────────────────────────┐
│           Cluster Operators                  │
│   跨命名空间路由策略                          │
│   全局安全和可观测性配置                       │
└─────────────────────────────────────────────┘
                     ↓
┌─────────────────────────────────────────────┐
│           Application Developers             │
│   HTTPRoute: 应用级路由规则                   │
│   绑定到 Gateway                             │
└─────────────────────────────────────────────┘

3.3 迁移实战：从 Ingress 到 Gateway API

原 Ingress 配置

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-app-ingress
  namespace: production
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/proxy-body-size: "50m"
    nginx.ingress.kubernetes.io/rate-limit: "100"
spec:
  ingressClassName: nginx
  tls:
  - hosts:
    - api.example.com
    secretName: tls-secret
  rules:
  - host: api.example.com
    http:
      paths:
      - path: /v1
        pathType: Prefix
        backend:
          service:
            name: api-v1-service
            port:
              number: 8080
      - path: /v2
        pathType: Prefix
        backend:
          service:
            name: api-v2-service
            port:
              number: 8080

迁移后 Gateway API 配置

# 第一步：定义 GatewayClass
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: envoy-gateway-class
spec:
  controllerName: gateway.envoyproxy.io/gatewayclass-controller
  parametersRef:
    group: config.gateway.envoyproxy.io
    kind: EnvoyProxy
    name: envoy-proxy-config
---
# 第二步：定义 Gateway
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: production-gateway
  namespace: gateway-infra
spec:
  gatewayClassName: envoy-gateway-class
  listeners:
  - name: https-api
    protocol: HTTPS
    port: 443
    hostname: "api.example.com"
    tls:
      mode: Terminate
      certificateRefs:
      - name: tls-secret
        namespace: production
    allowedRoutes:
      namespaces:
        from: Selector
        selector:
          matchLabels:
            gateway-access: production
---
# 第三步：定义 HTTPRoute
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: api-route
  namespace: production
  labels:
    gateway-access: production
spec:
  parentRefs:
  - name: production-gateway
    namespace: gateway-infra
    sectionName: https-api
  hostnames:
  - "api.example.com"
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /v1
    filters:
    - type: RequestMirror
      requestMirror:
        backendRef:
          name: api-v1-canary-service
          port: 8080
    backendRefs:
    - name: api-v1-service
      port: 8080
      weight: 90
    - name: api-v1-canary-service
      port: 8080
      weight: 10
  - matches:
    - path:
        type: PathPrefix
        value: /v2
    backendRefs:
    - name: api-v2-service
      port: 8080
---
# 第四步：配置速率限制（Gateway API 扩展）
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: EnvoyPatchPolicy
metadata:
  name: rate-limit-policy
  namespace: gateway-infra
spec:
  targetRef:
    group: gateway.networking.k8s.io
    kind: Gateway
    name: production-gateway
  type: Merge
  patches:
  - path: /filter_chains/0/filters/0/typed_config
    operation: merge
    value:
      "@type": type.googleapis.com/envoy.extensions.filters.http.local_ratelimit.v3.LocalRateLimit
      stat_prefix: http_local_rate_limiter
      token_bucket:
        max_tokens: 100
        tokens_per_fill: 100
        fill_interval: 1s

迁移工具链

# 官方迁移工具
kubectl ingress2gateway print --namespace production \
  --ingress my-app-ingress \
  > gateway-migration.yaml

# 验证生成的配置
kubectl apply --dry-run=client -f gateway-migration.yaml

# 渐进式迁移：同时运行 Ingress 和 Gateway
# 1. 部署 Gateway（不删除 Ingress）
kubectl apply -f gateway-migration.yaml

# 2. 验证 Gateway 流量
kubectl get httproute -n production
kubectl logs -n gateway-infra -l app=envoy-gateway

# 3. 逐步切换 DNS（蓝绿部署）
# 4. 监控错误率稳定后删除 Ingress
kubectl delete ingress my-app-ingress -n production

3.4 Envoy Gateway 部署架构

┌─────────────────────────────────────────────────────────┐
│                    Control Plane                         │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐  │
│  │   Gateway   │    │  Envoy      │    │   OIDC      │  │
│  │   Manager   │───▶│ Gateway     │◀───│   Provider  │  │
│  │             │    │ (Infra Mgr) │    │             │  │
│  └─────────────┘    └─────────────┘    └─────────────┘  │
│        │                   │                           │
│        ▼                   ▼                           │
│  ┌─────────────┐    ┌─────────────┐                    │
│  │  Rate Limit │    │   Xds       │                    │
│  │   Service   │    │  Server     │                    │
│  └─────────────┘    └─────────────┘                    │
└─────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────┐
│                    Data Plane                            │
│  ┌─────────────────────────────────────────────────────┐│
│  │                  Envoy Proxy                         ││
│  │  ┌─────────┐  ┌─────────┐  ┌─────────┐            ││
│  │  │ Listener│  │  Route  │  │ Cluster │            ││
│  │  │  443    │─▶│ Config  │─▶│ Manager │            ││
│  │  └─────────┘  └─────────┘  └─────────┘            ││
│  │       │            │              │               ││
│  │       ▼            ▼              ▼               ││
│  │  ┌──────────────────────────────────────────────┐ ││
│  │  │              Filters Chain                     │ ││
│  │  │  Rate Limit │ Auth │ WAF │ CORS │ Compression │ ││
│  │  └──────────────────────────────────────────────┘ ││
│  └─────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────┐
│                    Service Mesh                          │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐        │
│  │ API Service│  │ Web Service│  │ ML Service │        │
│  │   :8080    │  │   :80      │  │   :9090    │        │
│  └────────────┘  └────────────┘  └────────────┘        │
└─────────────────────────────────────────────────────────┘

四、性能革命：SELinux 卷标签优化

4.1 问题背景

在启用 SELinux 的节点上（RHEL/CentOS/Fedora），Pod 挂载卷时需要正确设置 SELinux 标签。传统方式是递归遍历所有文件重新标签化（relabel），对于大型卷（如数据库数据目录），这个过程可能需要数分钟甚至更久。

性能瓶颈分析

# 传统 re-label 流程
def traditional_relabel(volume_path, context):
    """
    递归遍历所有文件，逐个设置 SELinux 标签
    时间复杂度: O(n)，n 为文件数量
    """
    file_count = 0
    total_size = 0
    
    for root, dirs, files in os.walk(volume_path):
        for file in files:
            file_path = os.path.join(root, file)
            # 每个文件调用一次 setfilecon 系统调用
            setfilecon(file_path, context)
            file_count += 1
            total_size += os.path.getsize(file_path)
    
    # 实测数据：100GB 卷，约 50 万文件
    # 耗时约 180-240 秒
    return file_count, total_size

实测数据

卷大小	文件数量	传统耗时	Pod 启动延迟
10GB	50,000	18s	+25s
100GB	500,000	180s	+200s
1TB	5,000,000	1800s	超时失败

4.2 v1.36 的优化方案

新方案使用 mount -o context= 选项，在挂载时一次性应用 SELinux 标签，时间复杂度从 O(n) 降为 O(1)。

// 内核层面的优化
// fs/selinux/mount.c
static int selinux_mount_opt_context(struct fs_context *fc, const char *context)
{
    struct selinux_mnt_opts *opts = fc->security;
    
    // 挂载时直接设置 superblock 的 SELinux 上下文
    // 所有文件自动继承该上下文，无需逐个处理
    opts->context = kstrdup(context, GFP_KERNEL);
    if (!opts->context)
        return -ENOMEM;
    
    return 0;
}

性能对比

# Pod 启动时序对比
传统方式:
  volumeMount:
    type: persistentVolumeClaim
    name: data-volume
  events:
  - timestamp: 0s
    event: PodScheduled
  - timestamp: 25s
    event: Pulling image
  - timestamp: 30s
    event: Pulled image
  - timestamp: 31s
    event: Creating volume mount  # 开始 relabel
  - timestamp: 231s               # 200 秒后完成
    event: Mounted volume
  - timestamp: 232s
    event: Started container

优化后:
  volumeMount:
    type: persistentVolumeClaim
    name: data-volume
    selinuxMount: true           # 启用优化
  events:
  - timestamp: 0s
    event: PodScheduled
  - timestamp: 25s
    event: Pulling image
  - timestamp: 30s
    event: Pulled image
  - timestamp: 31s
    event: Creating volume mount  # mount -o context=
  - timestamp: 32s               # 1 秒完成
    event: Mounted volume
  - timestamp: 33s
    event: Started container

4.3 配置与注意事项

apiVersion: v1
kind: Pod
metadata:
  name: selinux-optimized-pod
spec:
  securityContext:
    seLinuxOptions:
      level: "s0:c123,c456"
      user: "system_u"
      role: "system_r"
      type: "svirt_lxc_net_t"
    # 关键配置：选择挂载优化策略
    seLinuxChangePolicy: MountOption  # 新增字段
  containers:
  - name: app
    image: postgres:16
    securityContext:
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: true
    volumeMounts:
    - name: data
      mountPath: /var/lib/postgresql/data
  volumes:
  - name: data
    persistentVolumeClaim:
      claimName: postgres-data-pvc
    # 卷级别配置（优先级高于 Pod 级别）
    selinuxMount: true

策略选项说明

值	行为	适用场景
`MountOption`	优先使用挂载优化	推荐默认值
`Recursive`	回退到递归标签化	兼容旧内核
`None`	不处理 SELinux 标签	特殊安全策略

重要警告：混合使用场景

# 危险配置示例（不要这样用！）
apiVersion: v1
kind: Pod
metadata:
  name: dangerous-mixed-pod
spec:
  # 特权容器使用优化挂载
  containers:
  - name: privileged-app
    image: admin-tool:latest
    securityContext:
      privileged: true
    volumeMounts:
    - name: shared-data
      mountPath: /data
  
  # 非特权容器期望安全标签
  - name: unprivileged-app
    image: app:latest
    securityContext:
      runAsNonRoot: true
    volumeMounts:
    - name: shared-data
      mountPath: /app-data
      readOnly: true
  
  volumes:
  - name: shared-data
    persistentVolumeClaim:
      claimName: shared-pvc
    # 问题：特权容器可能创建不同标签的文件
    # 非特权容器无法读取

安全最佳实践

单一 Pod 单一卷策略：避免共享卷混用
最小权限原则：只授予必要的 SELinux 类型
定期审计：使用 ls -Z 检查标签一致性
监控告警：设置 SELinux 拒绝日志告警

# 检查 SELinux 标签
kubectl exec -n production postgres-pod -- ls -laZ /var/lib/postgresql/data

# 查看 SELinux 拒绝日志
kubectl exec -n production postgres-pod -- ausearch -m avc -ts recent

# 生成自定义策略（如果需要）
kubectl exec -n production postgres-pod -- audit2allow -a -M postgres_custom

五、Dynamic Resource Allocation：AI 原生调度的成熟化

5.1 DRA 架构深度解析

DRA（Dynamic Resource Allocation）是 Kubernetes 为 AI/ML 工作负载设计的资源管理框架，其核心理念是将「设备」作为一等公民纳入调度决策。

架构层次

┌─────────────────────────────────────────────────────────┐
│                   Kubernetes API                         │
│  ┌──────────────────────────────────────────────────┐  │
│  │  ResourceClaim (申请资源)                          │  │
│  │  DeviceClass (定义设备类型和分配策略)              │  │
│  │  ResourceSlice (描述节点设备)                      │  │
│  └──────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────┐
│                  Scheduler Plugin                        │
│  ┌──────────────────────────────────────────────────┐  │
│  │  Filter: 检查设备可用性                           │  │
│  │  Score: 优化设备分配                               │  │
│  │  Reserve: 预留设备                                │  │
│  │  Allocate: 绑定设备到 Pod                         │  │
│  └──────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────┐
│                   DRA Driver                             │
│  ┌──────────────────────────────────────────────────┐  │
│  │  设备发现（通过 NodeGetInfo）                      │  │
│  │  设备分配（通过 NodePrepareResourceClaims）       │  │
│  │  设备状态监控                                      │  │
│  └──────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────┐
│                 Hardware Devices                         │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐        │
│  │    GPU     │  │    FPGA    │  │    NIC     │        │
│  │  NVIDIA    │  │  Intel/AMD │  │  Mellanox │        │
│  └────────────┘  └────────────┘  └────────────┘        │
└─────────────────────────────────────────────────────────┘

5.2 v1.36 新特性：设备 Taint 和 Toleration（Beta）

设备 taint 是 v1.36 引入的重要特性，它允许将设备标记为「受污染」，只有明确声明的 Pod 才能使用。

使用场景

场景	Taint 配置	效果
专用 GPU 保留	`nvidia.com/gpu-reserved=true:NoSchedule`	仅允许指定团队使用
测试设备隔离	`device-type=test:NoExecute`	隔离测试设备
故障设备隔离	`device-state=degraded:NoExecute`	自动驱逐 Pod
维护模式	`maintenance=true:NoSchedule`	暂停分配

配置示例

# 定义带有 taint 的设备
apiVersion: resource.k8s.io/v1alpha3
kind: DeviceClass
metadata:
  name: reserved-gpu-class
spec:
  selectors:
  - cel:
      expression: |
        device.driver == "nvidia.com/gpu" &&
        device.attributes["nvidia.com"].model == "A100"
  config:
  - opaque:
      driver: nvidia.com
      parameters:
        taints:
        - key: "nvidia.com/gpu-reserved"
          value: "ml-team"
          effect: "NoSchedule"
---
# 使用该设备的 Pod（带 toleration）
apiVersion: v1
kind: Pod
metadata:
  name: ml-training-pod
spec:
  containers:
  - name: trainer
    image: pytorch/pytorch:2.5.0-cuda12.4
    resources:
      claims:
      - name: gpu-claim
  resourceClaims:
  - name: gpu-claim
    resourceClaimName: reserved-gpu-claim
  tolerations:
  - key: "nvidia.com/gpu-reserved"
    operator: "Equal"
    value: "ml-team"
    effect: "NoSchedule"
---
apiVersion: resource.k8s.io/v1alpha3
kind: ResourceClaim
metadata:
  name: reserved-gpu-claim
spec:
  deviceClassName: reserved-gpu-class
  devices:
  - name: gpu-0
    selectors:
    - cel:
        expression: device.driver == "nvidia.com/gpu"

5.3 可分区设备支持（Beta）

这是 DRA 最具革命性的特性之一：将单个物理设备（如 GPU）分割成多个逻辑单元。

MIG（Multi-Instance GPU）分区示例

# 定义可分区 GPU 类
apiVersion: resource.k8s.io/v1alpha3
kind: DeviceClass
metadata:
  name: mig-partitioned-gpu
spec:
  selectors:
  - cel:
      expression: device.driver == "nvidia.com/gpu"
  config:
  - opaque:
      driver: nvidia.com
      parameters:
        # MIG 分区规格
        migStrategy: "mixed"
        partitionSpecs:
        - name: "1g.5gb"      # 1 个 GPU 实例，5GB 显存
          count: 7            # 每个 A100 最多 7 个实例
        - name: "2g.10gb"     # 2 个 GPU 实例，10GB 显存
          count: 3
        - name: "3g.20gb"     # 3 个 GPU 实例，20GB 显存
          count: 2
---
# 申请 5GB 显存的 MIG 分区
apiVersion: resource.k8s.io/v1alpha3
kind: ResourceClaim
metadata:
  name: mig-5gb-claim
spec:
  deviceClassName: mig-partitioned-gpu
  devices:
  - name: mig-instance
    selectors:
    - cel:
        expression: |
          device.attributes["nvidia.com"].migProfile == "1g.5gb"
---
# 使用 MIG 分区的推理服务
apiVersion: apps/v1
kind: Deployment
metadata:
  name: inference-service
spec:
  replicas: 5  # 每个 Pod 使用一个 MIG 实例
  template:
    spec:
      containers:
      - name: inference
        image: huggingface/text-generation-inference:latest
        resources:
          claims:
          - name: mig-gpu
      resourceClaims:
      - name: mig-gpu
        resourceClaimName: mig-5gb-claim

资源利用率对比

配置	传统独占	MIG 分区	利用率提升
A100 80GB	1 Pod/GPU	7 Pods/GPU	600%
H100 188GB	1 Pod/GPU	10 Pods/GPU	900%
成本节省	基准	-70%	显著

5.4 生产级 GPU 工作负载示例

# 完整的 ML 训练工作负载配置
apiVersion: batch/v1
kind: Job
metadata:
  name: distributed-training-job
  namespace: ml-team
spec:
  parallelism: 4
  completions: 4
  template:
    metadata:
      labels:
        app: distributed-training
    spec:
      serviceAccountName: training-sa
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        fsGroup: 1000
        seccompProfile:
          type: RuntimeDefault
      containers:
      - name: trainer
        image: pytorch/pytorch:2.5.0-cuda12.4-cudnn9-devel
        securityContext:
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          capabilities:
            drop:
            - ALL
        command:
        - python3
        - /app/train.py
        - --backend=nccl
        - --world-size=4
        env:
        - name: MASTER_ADDR
          value: "training-master-0.training-master-headless.ml-team.svc.cluster.local"
        - name: MASTER_PORT
          value: "29500"
        - name: RANK
          valueFrom:
            fieldRef:
              fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']
        - name: NCCL_DEBUG
          value: "INFO"
        - name: NCCL_SOCKET_IFNAME
          value: "eth0"
        resources:
          claims:
          - name: gpu-claim
          - name: rdma-claim
        volumeMounts:
        - name: training-data
          mountPath: /data
          readOnly: true
        - name: model-checkpoints
          mountPath: /checkpoints
        - name: dshm
          mountPath: /dev/shm
      volumes:
      - name: training-data
        persistentVolumeClaim:
          claimName: training-data-pvc
      - name: model-checkpoints
        persistentVolumeClaim:
          claimName: checkpoints-pvc
      - name: dshm
        emptyDir:
          medium: Memory
          sizeLimit: "16Gi"
      resourceClaims:
      - name: gpu-claim
        resourceClaimTemplate:
          spec:
            deviceClassName: mig-partitioned-gpu
      - name: rdma-claim
        resourceClaimTemplate:
          spec:
            deviceClassName: rdma-nic-class
      nodeSelector:
        accelerator: nvidia-a100
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"
      - key: "node.kubernetes.io/rdma"
        operator: "Exists"
        effect: "NoSchedule"
---
# RDMA 网卡设备类
apiVersion: resource.k8s.io/v1alpha3
kind: DeviceClass
metadata:
  name: rdma-nic-class
spec:
  selectors:
  - cel:
      expression: device.driver == "rdma.resource.k8s.io"
  config:
  - opaque:
      driver: rdma.resource.k8s.io
      parameters:
        deviceType: "ConnectX-6"
        maxIbDevices: 1
        enableGdr: true

六、生产升级检查清单

6.1 升级前审计脚本

#!/bin/bash
# k8s-v136-pre-upgrade-check.sh
# Kubernetes v1.36 升级前完整检查

set -e

RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m'

echo "========================================"
echo "Kubernetes v1.36 升级前检查"
echo "========================================"
echo ""

# 检查 1: gitRepo 卷使用
echo -e "${YELLOW}[检查 1/8] gitRepo 卷使用情况${NC}"
GITPOD_COUNT=$(kubectl get pods --all-namespaces -o json | jq -r '.items[] | select(.spec.volumes[]?.gitRepo != null) | .metadata.namespace + "/" + .metadata.name' | wc -l)
if [ "$GITPOD_COUNT" -gt 0 ]; then
    echo -e "${RED}✗ 发现 $GITPOD_COUNT 个 Pod 使用 gitRepo 卷${NC}"
    kubectl get pods --all-namespaces -o json | jq -r '.items[] | select(.spec.volumes[]?.gitRepo != null) | .metadata.namespace + "/" + .metadata.name'
    echo ""
    echo "迁移建议: 使用 initContainer + git-sync 或 ConfigMap"
else
    echo -e "${GREEN}✓ 无 gitRepo 卷使用${NC}"
fi
echo ""

# 检查 2: externalIPs 使用
echo -e "${YELLOW}[检查 2/8] externalIPs 使用情况${NC}"
EXTERNAL_IP_SVCS=$(kubectl get services --all-namespaces -o json | jq -r '.items[] | select(.spec.externalIPs != null and .spec.externalIPs != []) | .metadata.namespace + "/" + .metadata.name')
if [ -n "$EXTERNAL_IP_SVCS" ]; then
    echo -e "${RED}✗ 发现使用 externalIPs 的 Service:${NC}"
    echo "$EXTERNAL_IP_SVCS"
    echo ""
    echo "迁移建议: 使用 LoadBalancer 或 Gateway API"
else
    echo -e "${GREEN}✓ 无 externalIPs 使用${NC}"
fi
echo ""

# 检查 3: Ingress NGINX 部署
echo -e "${YELLOW}[检查 3/8] Ingress NGINX 部署${NC}"
INGRESS_NGINX=$(kubectl get deployments --all-namespaces -l app.kubernetes.io/name=ingress-nginx -o json 2>/dev/null | jq -r '.items[] | .metadata.namespace + "/" + .metadata.name' || echo "")
if [ -n "$INGRESS_NGINX" ]; then
    echo -e "${YELLOW}⚠ Ingress NGINX 已退役，建议迁移到 Gateway API${NC}"
    echo "当前部署:"
    echo "$INGRESS_NGINX"
    echo ""
    echo "迁移工具: kubectl ingress2gateway print"
else
    echo -e "${GREEN}✓ 无 Ingress NGINX 部署${NC}"
fi
echo ""

# 检查 4: SELinux 状态
echo -e "${YELLOW}[检查 4/8] 节点 SELinux 状态${NC}"
for node in $(kubectl get nodes -o jsonpath='{.items[*].metadata.name}'); do
    echo "节点: $node"
    # 假设使用 ssh 访问
    # ssh $node "getenforce" || echo "无法检查"
done
echo ""

# 检查 5: DRA 驱动兼容性
echo -e "${YELLOW}[检查 5/8] DRA 驱动兼容性${NC}"
DRA_DRIVERS=$(kubectl get deviceclasses -A -o json 2>/dev/null | jq -r '.items[].metadata.name' || echo "")
if [ -n "$DRA_DRIVERS" ]; then
    echo -e "${GREEN}✓ 检测到 DRA 设备类:${NC}"
    echo "$DRA_DRIVERS"
else
    echo "未检测到 DRA 设备类（可能未启用或无 GPU 工作负载）"
fi
echo ""

# 检查 6: 弃用 API 版本
echo -e "${YELLOW}[检查 6/8] 弃用 API 版本${NC}"
DEPRECATED_APIS=$(kubectl get apiversions -o json | jq -r '.groups[]?.versions[]? | select(.deprecated == true) | .groupVersion' 2>/dev/null || echo "")
if [ -n "$DEPRECATED_APIS" ]; then
    echo -e "${YELLOW}⚠ 发现弃用的 API 版本:${NC}"
    echo "$DEPRECATED_APIS"
else
    echo -e "${GREEN}✓ 无弃用 API 警告${NC}"
fi
echo ""

# 检查 7: Pod 安全策略
echo -e "${YELLOW}[检查 7/8] Pod 安全策略审计${NC}"
PRIVILEGED_PODS=$(kubectl get pods --all-namespaces -o json | jq -r '.items[] | select(.spec.securityContext.privileged == true or .spec.containers[].securityContext.privileged == true) | .metadata.namespace + "/" + .metadata.name' | head -10)
if [ -n "$PRIVILEGED_PODS" ]; then
    echo -e "${YELLOW}⚠ 发现特权 Pod（前 10 个）:${NC}"
    echo "$PRIVILEGED_PODS"
else
    echo -e "${GREEN}✓ 无特权 Pod${NC}"
fi
echo ""

# 检查 8: 资源配额与限制
echo -e "${YELLOW}[检查 8/8] 资源配额审计${NC}"
LIMIT_RANGES=$(kubectl get limitranges --all-namespaces -o json | jq -r '.items[] | select(.spec.limits[]?.type == "Pod") | .metadata.namespace + "/" + .metadata.name')
if [ -n "$LIMIT_RANGES" ]; then
    echo -e "${GREEN}✓ 已配置 LimitRange 的命名空间:${NC}"
    echo "$LIMIT_RANGES" | head -10
else
    echo -e "${YELLOW}⚠ 部分命名空间未配置 LimitRange${NC}"
fi
echo ""

# 版本检查
echo "========================================"
echo "当前集群版本:"
kubectl version --short 2>/dev/null || kubectl version
echo "========================================"
echo ""
echo "升级建议:"
echo "1. 备份 etcd 数据"
echo "2. 升级控制平面（先 master 后 worker）"
echo "3. 验证工作负载运行状态"
echo "4. 迁移 Ingress NGINX → Gateway API"
echo "5. 处理 gitRepo 和 externalIPs 弃用"
echo ""

6.2 分阶段升级计划

第一阶段：测试环境验证（2-4 周）
├── 部署 v1.36 测试集群
├── 运行 pre-upgrade-check.sh
├── 迁移 gitRepo 工作负载
├── 测试 Gateway API 替代方案
└── 验证 DRA 功能（如适用）

第二阶段：预生产验证（1-2 周）
├── 克隆生产数据到预生产环境
├── 执行完整升级流程
├── 性能基准测试
├── 安全审计
└── 灾难恢复演练

第三阶段：生产升级（按可用区滚动）
├── 升级控制平面（逐个 master）
├── 升级节点池（10% → 50% → 100%）
├── 监控告警验证
├── 应用层迁移（Ingress → Gateway）
└── 清理废弃配置

七、总结与展望

Kubernetes v1.36 是一个承上启下的版本：

安全层面

gitRepo 移除消除节点级代码执行风险
externalIPs 弃用减少中间人攻击面
ServiceAccount 外部签名支持零信任架构

性能层面

SELinux 挂载优化减少 Pod 启动延迟 95%+
DRA 分区提升 GPU 利用率 600%+

架构层面

Gateway API 成为入站流量管理标准
DRA 让「设备」成为一等公民

对于生产集群运维团队，我的建议是：

不要急于升级：等待 2-4 周观察社区反馈
优先处理废弃特性：gitRepo 和 Ingress NGINX 的迁移成本最高
拥抱 Gateway API：这不是「要不要」的问题，而是「什么时候」的问题
关注 DRA 进展：如果你的工作负载涉及 GPU，DRA 将改变你的资源管理方式

Kubernetes 正在进化成为一个 AI 原生的容器编排平台。v1.36 是这个转型过程中的关键一步。

延伸阅读