Kubeflow 1.1導入でハマったこと

Tetsuya Isogai
32 min readSep 21, 2020

TL;DR

Kubeflowは四半期に一回くらい触れなければならない時があり、その都度穴という穴全部踏んでるんではないかというくらいハマる。最終的にインストールできたわーとなった後、別の機会でもう一度やろうとしたときにもう一度ハマる。ということを繰り返しているので今回はちゃんとメモろうと思う。

導入するもの

  • Kubeflow 1.1 細かくは末尾参照。
$ kfctl version
kfctl v1.1.0-0-g9a3621e

導入の仕方

今回Kubeflow 1.1が必要になった。が、手元には0.7しかない。そして1.1の導入にもし失敗したときにKubeflowが何もなくなってしまうので上書きしたくはない。という事で
1. Kubeflow 0.7のVMをクローン
2. クローンしたVMの0.7を消す
3.1.1を入れる
という手順にすることにした。

導入環境

1ノードクラスタ、Master/Worker兼任。ヘタレですんません。

OS: 18.04.3 LTS (Bionic Beaver)
Kubernetes: v1.15.7

ハマったこと

PV

いつもやってしまう。通算何度目?
今回はこれ使ってみた。NodeのローカルディスクをDinamic ProvisioningのStorage Classとして作れるので今回のような「外部ストレージとかがっつり用意するのも面倒なんだけどDinamic Provisioningはほしい」ってときに結構便利。

必要ならdefault storageClassにしておこう。

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
annotations:
storageclass.kubernetes.io/is-default-class: "true"

リソース不足

元々0.7の環境は4vCPUで動いていた。が今回はリソース不足で動かず。面倒なので12vCPUまで上げた。最終的なリソース消費量はこれくらいでした。(繰り返し:1ノードでMaster/Worker兼任させて動いている状態でこのくらい)

$ kubectl describe node
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 5520m (46%) 45900m (382%)
memory 7408896512 (22%) 30141130Ki (91%)
ephemeral-storage 0 (0%) 0 (0%)

istio-tokenがマウントできないエラー

MountVolume.SetUp failed for volume “istio-token” : failed to fetch token: the server could not find the requested resource

謎のエラー。以下で解消。

https://github.com/kubeflow/manifests/issues/959
https://github.com/kubeflow/manifests/issues/959#issuecomment-593289634

GUIのエラー

Kubeflowのダッシュボードはkubeflow Namespace内のService経由ではなく、istio-system Namespaceのservice/istio-ingressgateway経由になる。以下の太字になってるところのポート番号(80のProxyポート(という言い方が正しいのかどうかは不明))でアクセス。

$ kubectl get svc -n istio-system istio-ingressgateway
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
istio-ingressgateway NodePort 10.105.148.54 <none> 15020:31842/TCP,80:31380/TCP,443:31390/TCP,31400:31400/TCP,15029:31654/TCP,15030:30459/TCP,15031:30005/TCP,15032:31611/TCP,15443:32149/TCP 2d20h

https://github.com/kubeflow/kubeflow/issues/3615

小ネタ

-n kubeflowって1000回くらい打つのでset-contextしておくと便利です。

$ kubectl config set-context --current --namespace=kubeflow

仕上がり

$ kubectl get all -n kubeflow
NAME READY STATUS RESTARTS AGE
pod/admission-webhook-bootstrap-stateful-set-0 1/1 Running 4 2d20h
pod/admission-webhook-deployment-795bb748-pxcwx 1/1 Running 0 45h
pod/application-controller-stateful-set-0 1/1 Running 2 2d20h
pod/argo-ui-657d964995-8t4vg 1/1 Running 2 2d20h
pod/cache-deployer-deployment-867cf86c64-cjnxv 2/2 Running 3 2d20h
pod/cache-server-65596854d-wb42d 2/2 Running 0 2d20h
pod/centraldashboard-54c547bd7f-d2c42 1/1 Running 6 2d20h
pod/jupyter-web-app-deployment-56dc859fdd-l2gqn 1/1 Running 2 2d20h
pod/katib-controller-6fc96fddf8-xxpxv 1/1 Running 3 2d20h
pod/katib-db-manager-78d458db46-gpnqc 1/1 Running 257 2d20h
pod/katib-mysql-7f9cfccb98-45zxr 1/1 Running 2 2d20h
pod/katib-ui-74768457d5-8cvx5 1/1 Running 2 2d20h
pod/kfserving-controller-manager-0 2/2 Running 2 2d20h
pod/kubeflow-pipelines-profile-controller-588884d9bb-dk8jz 1/1 Running 2 2d20h
pod/metacontroller-0 1/1 Running 2 2d20h
pod/metadata-db-7fc598bbb5-kfr7b 1/1 Running 1 2d20h
pod/metadata-deployment-7578c6bc46-4wzbs 1/1 Running 497 2d20h
pod/metadata-envoy-deployment-75df6688bb-vx9w8 1/1 Running 2 2d20h
pod/metadata-grpc-deployment-76d44cfd88-czl2c 1/1 Running 222 2d20h
pod/metadata-ui-794f6dcc5b-7nw5b 1/1 Running 2 2d20h
pod/metadata-writer-694c48ccdc-qmvc5 2/2 Running 0 2d20h
pod/minio-655ddb4d95-ccqsx 1/1 Running 1 2d20h
pod/ml-pipeline-5df444d46d-65rgq 2/2 Running 0 2d20h
pod/ml-pipeline-persistenceagent-9f5c875d-dxvpp 2/2 Running 0 2d20h
pod/ml-pipeline-scheduledworkflow-768c4d65d4-gltdl 2/2 Running 0 2d20h
pod/ml-pipeline-ui-8589d58598-tcffh 2/2 Running 0 2d20h
pod/ml-pipeline-viewer-crd-5dd6cc5f56-wsj78 2/2 Running 1 2d20h
pod/ml-pipeline-visualizationserver-9b67b8b68-6cq76 2/2 Running 0 2d20h
pod/mpi-operator-55457d5f54-5f74v 1/1 Running 5 2d20h
pod/mxnet-operator-68bf5b4fbc-gdnc2 1/1 Running 4 2d20h
pod/mysql-56f64cfcc-z2kgq 2/2 Running 0 45h
pod/notebook-controller-deployment-6f789d748-5wbcv 1/1 Running 2 2d20h
pod/profiles-deployment-6fffd9c9-fwbt8 2/2 Running 4 2d20h
pod/pytorch-operator-d449c769b-hqm55 1/1 Running 9 2d20h
pod/seldon-controller-manager-68f9f7bff6-jkb57 1/1 Running 5 2d20h
pod/spark-operatorsparkoperator-758795c89b-vbrhf 1/1 Running 2 2d20h
pod/spartakus-volunteer-69f5b89c96-njknm 1/1 Running 2 2d20h
pod/tf-job-operator-644f847f5c-2844p 1/1 Running 9 2d20h
pod/workflow-controller-dd8985f4d-qxh8m 1/1 Running 2 2d20h
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/admission-webhook-service ClusterIP 10.106.24.104 <none> 443/TCP 2d20h
service/application-controller-service ClusterIP 10.97.13.119 <none> 443/TCP 2d20h
service/argo-ui NodePort 10.110.193.72 <none> 80:32065/TCP 2d20h
service/cache-server ClusterIP 10.98.200.153 <none> 443/TCP 2d20h
service/centraldashboard ClusterIP 10.110.2.44 <none> 80/TCP 2d20h
service/jupyter-web-app-service ClusterIP 10.97.229.226 <none> 80/TCP 2d20h
service/katib-controller ClusterIP 10.106.123.248 <none> 443/TCP,8080/TCP 2d20h
service/katib-db-manager ClusterIP 10.107.254.8 <none> 6789/TCP 2d20h
service/katib-mysql ClusterIP 10.108.229.228 <none> 3306/TCP 2d20h
service/katib-ui ClusterIP 10.108.163.144 <none> 80/TCP 2d20h
service/kfserving-controller-manager-metrics-service ClusterIP 10.107.37.169 <none> 8443/TCP 2d20h
service/kfserving-controller-manager-service ClusterIP 10.98.195.250 <none> 443/TCP 2d20h
service/kfserving-webhook-server-service ClusterIP 10.106.79.84 <none> 443/TCP 2d20h
service/kubeflow-pipelines-profile-controller ClusterIP 10.109.47.5 <none> 80/TCP 2d20h
service/metadata-db ClusterIP 10.99.251.151 <none> 3306/TCP 2d20h
service/metadata-envoy-service ClusterIP 10.100.48.115 <none> 9090/TCP 2d20h
service/metadata-grpc-service ClusterIP 10.100.33.121 <none> 8080/TCP 2d20h
service/metadata-service ClusterIP 10.97.165.97 <none> 8080/TCP 2d20h
service/metadata-ui ClusterIP 10.97.253.2 <none> 80/TCP 2d20h
service/minio-service ClusterIP 10.110.118.90 <none> 9000/TCP 2d20h
service/ml-pipeline ClusterIP 10.96.66.86 <none> 8888/TCP,8887/TCP 2d20h
service/ml-pipeline-ui ClusterIP 10.103.33.58 <none> 80/TCP 2d20h
service/ml-pipeline-visualizationserver ClusterIP 10.98.43.116 <none> 8888/TCP 2d20h
service/mysql ClusterIP 10.97.209.58 <none> 3306/TCP 2d20h
service/notebook-controller-service ClusterIP 10.110.5.82 <none> 443/TCP 2d20h
service/profiles-kfam ClusterIP 10.106.127.68 <none> 8081/TCP 2d20h
service/pytorch-operator ClusterIP 10.105.224.245 <none> 8443/TCP 2d20h
service/seldon-webhook-service ClusterIP 10.100.237.108 <none> 443/TCP 2d20h
service/tf-job-operator ClusterIP 10.108.121.94 <none> 8443/TCP 2d20h
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/admission-webhook-deployment 1/1 1 1 2d20h
deployment.apps/argo-ui 1/1 1 1 2d20h
deployment.apps/cache-deployer-deployment 1/1 1 1 2d20h
deployment.apps/cache-server 1/1 1 1 2d20h
deployment.apps/centraldashboard 1/1 1 1 2d20h
deployment.apps/jupyter-web-app-deployment 1/1 1 1 2d20h
deployment.apps/katib-controller 1/1 1 1 2d20h
deployment.apps/katib-db-manager 1/1 1 1 2d20h
deployment.apps/katib-mysql 1/1 1 1 2d20h
deployment.apps/katib-ui 1/1 1 1 2d20h
deployment.apps/kubeflow-pipelines-profile-controller 1/1 1 1 2d20h
deployment.apps/metadata-db 1/1 1 1 2d20h
deployment.apps/metadata-deployment 1/1 1 1 2d20h
deployment.apps/metadata-envoy-deployment 1/1 1 1 2d20h
deployment.apps/metadata-grpc-deployment 1/1 1 1 2d20h
deployment.apps/metadata-ui 1/1 1 1 2d20h
deployment.apps/metadata-writer 1/1 1 1 2d20h
deployment.apps/minio 1/1 1 1 2d20h
deployment.apps/ml-pipeline 1/1 1 1 2d20h
deployment.apps/ml-pipeline-persistenceagent 1/1 1 1 2d20h
deployment.apps/ml-pipeline-scheduledworkflow 1/1 1 1 2d20h
deployment.apps/ml-pipeline-ui 1/1 1 1 2d20h
deployment.apps/ml-pipeline-viewer-crd 1/1 1 1 2d20h
deployment.apps/ml-pipeline-visualizationserver 1/1 1 1 2d20h
deployment.apps/mpi-operator 1/1 1 1 2d20h
deployment.apps/mxnet-operator 1/1 1 1 2d20h
deployment.apps/mysql 1/1 1 1 2d20h
deployment.apps/notebook-controller-deployment 1/1 1 1 2d20h
deployment.apps/profiles-deployment 1/1 1 1 2d20h
deployment.apps/pytorch-operator 1/1 1 1 2d20h
deployment.apps/seldon-controller-manager 1/1 1 1 2d20h
deployment.apps/spark-operatorsparkoperator 1/1 1 1 2d20h
deployment.apps/spartakus-volunteer 1/1 1 1 2d20h
deployment.apps/tf-job-operator 1/1 1 1 2d20h
deployment.apps/workflow-controller 1/1 1 1 2d20h
NAME DESIRED CURRENT READY AGE
replicaset.apps/admission-webhook-deployment-795bb748 1 1 1 2d20h
replicaset.apps/argo-ui-657d964995 1 1 1 2d20h
replicaset.apps/cache-deployer-deployment-867cf86c64 1 1 1 2d20h
replicaset.apps/cache-server-65596854d 1 1 1 2d20h
replicaset.apps/centraldashboard-54c547bd7f 1 1 1 2d20h
replicaset.apps/jupyter-web-app-deployment-56dc859fdd 1 1 1 2d20h
replicaset.apps/katib-controller-6fc96fddf8 1 1 1 2d20h
replicaset.apps/katib-db-manager-78d458db46 1 1 1 2d20h
replicaset.apps/katib-mysql-7f9cfccb98 1 1 1 2d20h
replicaset.apps/katib-ui-74768457d5 1 1 1 2d20h
replicaset.apps/kubeflow-pipelines-profile-controller-588884d9bb 1 1 1 2d20h
replicaset.apps/metadata-db-7fc598bbb5 1 1 1 2d20h
replicaset.apps/metadata-deployment-7578c6bc46 1 1 1 2d20h
replicaset.apps/metadata-envoy-deployment-75df6688bb 1 1 1 2d20h
replicaset.apps/metadata-grpc-deployment-76d44cfd88 1 1 1 2d20h
replicaset.apps/metadata-ui-794f6dcc5b 1 1 1 2d20h
replicaset.apps/metadata-writer-694c48ccdc 1 1 1 2d20h
replicaset.apps/minio-655ddb4d95 1 1 1 2d20h
replicaset.apps/ml-pipeline-5df444d46d 1 1 1 2d20h
replicaset.apps/ml-pipeline-persistenceagent-9f5c875d 1 1 1 2d20h
replicaset.apps/ml-pipeline-scheduledworkflow-768c4d65d4 1 1 1 2d20h
replicaset.apps/ml-pipeline-ui-8589d58598 1 1 1 2d20h
replicaset.apps/ml-pipeline-viewer-crd-5dd6cc5f56 1 1 1 2d20h
replicaset.apps/ml-pipeline-visualizationserver-9b67b8b68 1 1 1 2d20h
replicaset.apps/mpi-operator-55457d5f54 1 1 1 2d20h
replicaset.apps/mxnet-operator-68bf5b4fbc 1 1 1 2d20h
replicaset.apps/mysql-56f64cfcc 1 1 1 2d20h
replicaset.apps/notebook-controller-deployment-6f789d748 1 1 1 2d20h
replicaset.apps/profiles-deployment-6fffd9c9 1 1 1 2d20h
replicaset.apps/pytorch-operator-d449c769b 1 1 1 2d20h
replicaset.apps/seldon-controller-manager-68f9f7bff6 1 1 1 2d20h
replicaset.apps/spark-operatorsparkoperator-758795c89b 1 1 1 2d20h
replicaset.apps/spartakus-volunteer-69f5b89c96 1 1 1 2d20h
replicaset.apps/tf-job-operator-644f847f5c 1 1 1 2d20h
replicaset.apps/workflow-controller-dd8985f4d 1 1 1 2d20h
NAME READY AGE
statefulset.apps/admission-webhook-bootstrap-stateful-set 1/1 2d20h
statefulset.apps/application-controller-stateful-set 1/1 2d20h
statefulset.apps/kfserving-controller-manager 1/1 2d20h
statefulset.apps/metacontroller 1/1 2d20h

おまけ:細かいバージョン

$ kubectl get pods -n kubeflow -o=custom-columns='NAME:.metadata.name,DATA:spec.containers[*].image'
NAME DATA
admission-webhook-bootstrap-stateful-set-0 gcr.io/kubeflow-images-public/ingress-setup:latest
admission-webhook-deployment-795bb748-pxcwx gcr.io/kubeflow-images-public/admission-webhook:vmaster-gaf96e4e3
application-controller-stateful-set-0 gcr.io/kubeflow-images-public/kubernetes-sigs/application:1.0-beta
argo-ui-657d964995-8t4vg argoproj/argoui:v2.3.0
cache-deployer-deployment-867cf86c64-cjnxv gcr.io/ml-pipeline/cache-deployer:1.0.0,gcr.io/istio-release/proxyv2:release-1.3-latest-daily
cache-server-65596854d-wb42d gcr.io/ml-pipeline/cache-server:1.0.0,gcr.io/istio-release/proxyv2:release-1.3-latest-daily
centraldashboard-54c547bd7f-d2c42 gcr.io/kubeflow-images-public/centraldashboard:vmaster-gf39279c0
jupyter-web-app-deployment-56dc859fdd-l2gqn gcr.io/kubeflow-images-public/jupyter-web-app:vmaster-gd9be4b9e
katib-controller-6fc96fddf8-xxpxv gcr.io/kubeflow-images-public/katib/v1alpha3/katib-controller:917164a
katib-db-manager-78d458db46-gpnqc gcr.io/kubeflow-images-public/katib/v1alpha3/katib-db-manager:917164a
katib-mysql-7f9cfccb98-45zxr mysql:8
katib-ui-74768457d5-8cvx5 gcr.io/kubeflow-images-public/katib/v1alpha3/katib-ui:917164a
kfserving-controller-manager-0 gcr.io/kubebuilder/kube-rbac-proxy:v0.4.0,gcr.io/kfserving/kfserving-controller:v0.3.0
kubeflow-pipelines-profile-controller-588884d9bb-dk8jz python:3.7
metacontroller-0 metacontroller/metacontroller:v0.3.0
metadata-db-7fc598bbb5-kfr7b mysql:8.0.3
metadata-deployment-7578c6bc46-4wzbs gcr.io/kubeflow-images-public/metadata:v0.1.11
metadata-envoy-deployment-75df6688bb-vx9w8 gcr.io/ml-pipeline/envoy:metadata-grpc
metadata-grpc-deployment-76d44cfd88-czl2c gcr.io/tfx-oss-public/ml_metadata_store_server:v0.21.1
metadata-ui-794f6dcc5b-7nw5b gcr.io/kubeflow-images-public/metadata-frontend:v0.1.8
metadata-writer-694c48ccdc-qmvc5 gcr.io/ml-pipeline/metadata-writer:1.0.0,gcr.io/istio-release/proxyv2:release-1.3-latest-daily
minio-655ddb4d95-ccqsx gcr.io/ml-pipeline/minio:RELEASE.2019-08-14T20-37-41Z-license-compliance
ml-pipeline-5df444d46d-65rgq gcr.io/ml-pipeline/api-server:1.0.0,gcr.io/istio-release/proxyv2:release-1.3-latest-daily
ml-pipeline-persistenceagent-9f5c875d-dxvpp gcr.io/ml-pipeline/persistenceagent:1.0.0,gcr.io/istio-release/proxyv2:release-1.3-latest-daily
ml-pipeline-scheduledworkflow-768c4d65d4-gltdl gcr.io/ml-pipeline/scheduledworkflow:1.0.0,gcr.io/istio-release/proxyv2:release-1.3-latest-daily
ml-pipeline-ui-8589d58598-tcffh gcr.io/ml-pipeline/frontend:1.0.0,gcr.io/istio-release/proxyv2:release-1.3-latest-daily
ml-pipeline-viewer-crd-5dd6cc5f56-wsj78 gcr.io/ml-pipeline/viewer-crd-controller:1.0.0,gcr.io/istio-release/proxyv2:release-1.3-latest-daily
ml-pipeline-visualizationserver-9b67b8b68-6cq76 gcr.io/ml-pipeline/visualization-server:1.0.0,gcr.io/istio-release/proxyv2:release-1.3-latest-daily
mpi-operator-55457d5f54-5f74v mpioperator/mpi-operator:latest
mxnet-operator-68bf5b4fbc-gdnc2 kubeflow/mxnet-operator:v1.0.0-20200625
mysql-56f64cfcc-z2kgq gcr.io/ml-pipeline/mysql:5.6,gcr.io/istio-release/proxyv2:release-1.3-latest-daily
notebook-controller-deployment-6f789d748-5wbcv gcr.io/kubeflow-images-public/notebook-controller:vmaster-gf39279c0
profiles-deployment-6fffd9c9-fwbt8 gcr.io/kubeflow-images-public/profile-controller:vmaster-g34aa47c2,gcr.io/kubeflow-images-public/kfam:v1.1.0-g9f3bfd00
pytorch-operator-d449c769b-hqm55 gcr.io/kubeflow-images-public/pytorch-operator:vmaster-gd596e904
seldon-controller-manager-68f9f7bff6-jkb57 docker.io/seldonio/seldon-core-operator:1.2.1
spark-operatorsparkoperator-758795c89b-vbrhf gcr.io/spark-operator/spark-operator:v1beta2-1.1.0-2.4.5
spartakus-volunteer-69f5b89c96-njknm gcr.io/google_containers/spartakus-amd64:v1.1.0
tf-job-operator-644f847f5c-2844p gcr.io/kubeflow-images-public/tf_operator:vmaster-ga2ae7bff
workflow-controller-dd8985f4d-qxh8m argoproj/workflow-controller:v2.3.0

Istioは1.3。かなり残念な感じ。

$ kubectl get pods -n istio-system -o=custom-columns='NAME:.metadata.name,DATA:spec.containers[*].image'
NAME DATA
cluster-local-gateway-f4967d447-57txx docker.io/istio/proxyv2:1.3.1
istio-citadel-79b5b568b-g6lnc gcr.io/istio-release/citadel:release-1.3-latest-daily
istio-galley-756f5f45c4-lhlsf gcr.io/istio-release/galley:release-1.3-latest-daily
istio-ingressgateway-77f74c944c-b2xxt gcr.io/istio-release/proxyv2:release-1.3-latest-daily
istio-nodeagent-f4bkx gcr.io/istio-release/node-agent-k8s:release-1.3-latest-daily
istio-pilot-55f7f6f6df-jdxcg gcr.io/istio-release/pilot:release-1.3-latest-daily,gcr.io/istio-release/proxyv2:release-1.3-latest-daily
istio-policy-76dbd68445-kcftf gcr.io/istio-release/mixer:release-1.3-latest-daily,gcr.io/istio-release/proxyv2:release-1.3-latest-daily
istio-security-post-install-release-1.3-latest-daily-wkzlz gcr.io/istio-release/kubectl:release-1.3-latest-daily
istio-sidecar-injector-5d9f474dcb-8v2vs gcr.io/istio-release/sidecar_injector:release-1.3-latest-daily
istio-telemetry-697c8fd794-d66xt gcr.io/istio-release/mixer:release-1.3-latest-daily,gcr.io/istio-release/proxyv2:release-1.3-latest-daily
prometheus-b845cc6fc-zcdqb docker.io/prom/prometheus:v2.8.0

--

--

Tetsuya Isogai

Working at Microsoft/Cloud Solution Architect/Azure Core Infra