使用 KubeFate基于 K8S 集群部署 FATE 集群
source link: https://www.hollischuang.com/archives/6638
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
开源联邦学习框架 Fate 有多种部署方式,分别是基于Docker-Compose的部署、standalone部署、Native的集群部署、基于KubeFATE的部署。
如果想快速体验一下FATE,或者跑的模型和数据在单台机器就够了,可以参考采用基于Docker-Compose的部署方案 ,部署起来比较简单。
如果您只是想开发算法,而开发机器性能又不高,也想去测试底层的egg那些模块,那standalone是很方便的方案;
如果对FATE的使用需求因数据集和模型变大,需要扩容,并且里面有数据需要维护一个FATE集群,则考虑使用基于KubeFATE在Kubernetes集群的部署方案。
最后一种Native的集群部署方案,一般是在特殊原因下才会用,如内部无法部署Kubernetes,或者需要对FATE的部署进行自己的二次开发等。
本次基于KubeFATE在Kubernetes集群的部署做了一下尝试,过程中遇到很多坑,因为官方文档介绍的是基于 MiniKube 部署测试环境的方案,所以,很多我遇到的问题,他们都没遇到。过程很漫长、且痛苦,最终终于部署成功了。
部署过程及遇到的问题如下:
本次部署过程,都是在 Master机器上执行的,不需要在 Node上执行。
已经有两个 K8S集群,并且这两个集群都部署了ingress-controller,可通过网络联通。(K8S集群部署及 Ingress安装参考:基于CentOS 部署一套 K8S 集群)
部署前先检查下K8S集群的机器情况
[root@k8s-master1 ~]# kubectl get node -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
k8s-master1 Ready control-plane,master 20h v1.22.3 172.29.247.176 <none> CentOS Linux 7 (Core) 3.10.0-693.2.2.el7.x86_64 docker://18.9.1
k8s-node1 Ready <none> 20h v1.22.3 172.29.247.175 <none> CentOS Linux 7 (Core) 3.10.0-693.2.2.el7.x86_64 docker://18.9.1
k8s-node2 Ready <none> 20h v1.22.3 172.29.247.177 <none> CentOS Linux 7 (Core) 3.10.0-693.2.2.el7.x86_64 docker://18.9.1
部署 KubeFATE
下载 kubeFate 的压缩包
wget https://github.com/FederatedAI/KubeFATE/releases/download/v1.5.1/kubefate-k8s-v1.5.1.tar.gz
我这里用的是1.5.1,可以把上面的连接替换成你想要的版本;
https://github.com/FederatedAI/KubeFATE/releases/download/{需要替换的 verssion}/kubefate-k8s-{需要替换的 verssion}.tar.gz
tar -zxvf kubefate-k8s-v1.5.1.tar.gz
解压后主要包含以下文件:
cluster-serving.yaml cluster-spark.yaml cluster.yaml config.yaml examples kubefate kubefate.yaml rbac-config.yaml
部署rbac-config.yaml
kubectl apply -f ./rbac-config.yaml
namespace/kube-fate created
serviceaccount/kubefate-admin created
clusterrolebinding.rbac.authorization.k8s.io/kubefate created
查看部署结果:
[root@k8s-master1 ~]# kubectl get pod,ingress -n kube-fate
NAME READY STATUS RESTARTS AGE
pod/kubefate-857dd6fcb5-srprb 1/1 Running 0 44s
pod/mariadb-7c8848bd55-w85dw 1/1 Running 0 44s
配置ingress host
查看ingress的pod运行的的node的名称
[root@k8s-master1 ~]# kubectl get pod -A -o wide | grep ingress
ingress-nginx nginx-ingress-controller-7d4544b644-j65jb 1/1 Running 0 17m 172.29.247.177 k8s-node2 <none> <none>
可以发现,node2是运行ingress的节点。
然后把 node2的Host做一下绑定到kubefate.net 上:
echo "192.168.1.1 kubefate.net" >> /etc/hosts
这里的92.168.1.1 替换成你自己的 IP。
安装 kubefate 命令行工具
chmod +x ./kubefate && sudo mv ./kubefate /usr/local/bin/kubefate
检查是否可以连通
[root@k8s-master1 ~]# kubefate version
* kubefate commandLine version=v1.3.0
* kubefate service version=v1.3.0
如果出现kubefate service version=v1.3.0 这一行,表示连通成功。
但是我在实际运行时出错了:
[root@k8s-master1 ~]# kubefate version
kubefate: /lib64/libc.so.6: version `GLIBC_2.28' not found (required by kubefate)
这是因为Kube 对一个动态库有依赖,这是个 bug,在1.6.1版本修复了,详见:https://github.com/FederatedAI/KubeFATE/issues/372
但是我这里因为外部机构用的1.5.1,我没办法升级到1.6.1,所以尝试着解决问题。
解决GLIBC_2.28 找不到
安装 GLIBC_2.28
curl -O http://ftp.gnu.org/gnu/glibc/glibc-2.28.tar.gz
tar zxf glibc-2.28.tar.gz
cd glibc-2.28/
mkdir build
cd build/
../configure --prefix=/usr/local/glibc-2.28
configure: error:
*** These critical programs are missing or too old: make bison compiler
*** Check the INSTALL file for required versions.
提示缺少 bison make 和 compiler ,需要依次安装。
查看本机并未安装 bison
[root@k8s-master1 ~]# bison --version
-bash: bison: 未找到命令
执行安装命令:
yum install bison
安装 make:
wget http://ftp.gnu.org/gnu/make/make-4.2.tar.gz
tar -xzvf make-4.2.tar.gz
cd make-4.2
sudo ./configure
sudo make
sudo make install
sudo rm -rf /usr/bin/make
sudo cp ./make /usr/bin/
make -v
安装 compile:
yum -y install centos-release-scl
yum -y install devtoolset-8-gcc devtoolset-8-gcc-c++ devtoolset-8-binutils
scl enable devtoolset-8 bash
echo "source /opt/rh/devtoolset-8/enable" >>/etc/profile
然后重新执行 GLIBC的安装
cd glibc-2.28/
mkdir build
cd build/
sudo ../configure --prefix=/usr --disable-profile --enable-add-ons --with-headers=/usr/include --with-binutils=/usr/bin
make
make install
查看安装结果:
strings /lib64/libc.so.6 |grep GLIBC_2.28
GLIBC_2.28
GLIBC_2.28
这时候再执行kubefate version就可以了。
connection refused解决
这里如果遇到问题:
[root@k8s-master1 build]# kubefate version
* kubefate commandLine version=v1.3.0
* kubefate service connection error, Post "http://localhost:8080/v1/user/login": dial tcp 127.0.0.1:8080: connect: connection refused
是因为 没有在ingress-controller的配置中增加 dnsPolicy: ClusterFirstWithHostNet
,详见:ingress 安装
如果遇到问题:
[root@k8s-master1 ~]# kubefate version
* kubefate commandLine version=v1.3.0
* kubefate service connection error, Post "http://kubefate.net/v1/user/login": dial tcp 192.168.1.1:80: i/o timeout
那可能是因为80端口没开,需要开启80端口。
部署Fate
创建namespace
kubectl create namespace fate-10000
fate-10000这个名字自己定义一个。这一步,如果有两个集群,需要生成两个,并使用不同的名字:
kubectl create namespace fate-10000
kubectl create namespace fate-9999
修改cluster.yaml 文件
在解压后的kubeFate文件夹中,有cluster.yaml文件,打开并编辑,内容如下:
name: fate-10000 -- 你的 partyId
namespace: fate-10000 -- 你的 namespace,和刚刚创建的保持一致
chartName: fate
chartVersion: v1.5.1
partyId: 10000
registry: ""
imageTag: ""
pullPolicy:
imagePullSecrets:
- name: myregistrykey
persistence: false
istio:
enabled: false
modules:
- rollsite
- clustermanager
- nodemanager
- mysql
- python
- fateboard
- client
rollsite:
type: NodePort
nodePort: 30100 ---你的端口号
partyList:
- partyId: 9999 ---需要做连邦学习的对方 partyId
partyIp: 192.168.1.1 ---需要做连邦学习的对方 IP
partyPort: 30101 ---需要做连邦学习的对方 端口
nodemanager:
count: 3
sessionProcessorsPerNode: 4
list:
- name: nodemanager
nodeSelector:
sessionProcessorsPerNode: 4
subPath: "nodemanager"
existingClaim: ""
storageClass: "nodemanager"
accessMode: ReadWriteOnce
size: 1Gi
python:
type: NodePort
httpNodePort: 30097
grpcNodePort: 30092
mysql:
nodeSelector:
ip: mysql
port: 3306
database: eggroll_meta
user: fate
password: fate_dev
subPath: ""
existingClaim: ""
storageClass: "mysql"
accessMode: ReadWriteOnce
size: 1Gi
这里如果有两个集群做直连,则需要在创建一个cluster2.yaml文件:
name: fate-9999 -- 你的 partyId
namespace: fate-9999 -- 你的 namespace,和刚刚创建的保持一致
chartName: fate
chartVersion: v1.5.1
partyId: 9999
registry: ""
imageTag: ""
pullPolicy:
imagePullSecrets:
- name: myregistrykey
persistence: false
istio:
enabled: false
modules:
- rollsite
- clustermanager
- nodemanager
- mysql
- python
- fateboard
- client
rollsite:
type: NodePort
nodePort: 30101 ---你的端口号
partyList:
- partyId: 10000 ---需要做连邦学习的对方 partyId
partyIp: 192.168.1.2 ---需要做连邦学习的对方 IP
partyPort: 30100 ---需要做连邦学习的对方 端口
nodemanager:
count: 3
sessionProcessorsPerNode: 4
list:
- name: nodemanager
nodeSelector:
sessionProcessorsPerNode: 4
subPath: "nodemanager"
existingClaim: ""
storageClass: "nodemanager"
accessMode: ReadWriteOnce
size: 1Gi
python:
type: NodePort
httpNodePort: 30097
grpcNodePort: 30092
mysql:
nodeSelector:
ip: mysql
port: 3306
database: eggroll_meta
user: fate
password: fate_dev
subPath: ""
existingClaim: ""
storageClass: "mysql"
accessMode: ReadWriteOnce
size: 1Gi
执行 fate 部署
[root@k8s-master1 ~]# kubefate cluster install -f cluster.yaml
create job success, job id=d8caf016-2269-4d0d-bb3f-5c8ad9824350
[root@k8s-master1 ~]# kubefate cluster install -f cluster2.yaml
create job success, job id=d8caf016-2269-4d0d-bb3f-5c8ad9824350
查看任务状态:
[root@k8s-master1 ~]# kubefate job list
UUID CREATOR METHOD STATUS STARTTIME CLUSTERID AGE
d8caf016-2269-4d0d-bb3f-5c8ad9824350 admin ClusterInstall Failed 2021-11-09 03:58:11 558ce45f-27c5-474e-951a-093637d0e484 3s
如果有失败,查看报错信息:
[root@k8s-master1 ~]# kubefate job describe d8caf016-2269-4d0d-bb3f-5c8ad9824350
UUID d8caf016-2269-4d0d-bb3f-5c8ad9824350
StartTime 2021-11-09 03:58:11
EndTime 2021-11-09 03:58:15
Duration 3s
Status Failed
Creator admin
ClusterId 558ce45f-27c5-474e-951a-093637d0e484
Result ConfigMap "nodemanager-0-config" is invalid: metadata.labels: Invalid value: "1.6881e+06": a valid label must be an empty string or consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyValue', or 'my_value', or '12345', regex used for validation is
'(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?')
根据报错重修修改配置文件后,再执行,正常的话,会提示 Running:
[root@k8s-master1 ~]# kubefate job list
UUID CREATOR METHOD STATUS STARTTIME CLUSTERID AGE
4e587c68-8ad4-4546-bd1e-3d7a36b7d694 admin ClusterInstall Running 2021-11-09 04:54:26 ce23b18b-cea8-462c-95e7-76192ef08ea7 9s
这时候查看任务的详细信息
kubefate job describe 4e587c68-8ad4-4546-bd1e-3d7a36b7d694
UUID 4e587c68-8ad4-4546-bd1e-3d7a36b7d694
StartTime 2021-11-09 04:54:26
EndTime 0001-01-01 00:00:00
Duration 41s
Status Running
Creator admin
ClusterId ce23b18b-cea8-462c-95e7-76192ef08ea7
Result Cluster install success
SubJobs nodemanager-2-6dd74c79cb-7p5gj PodStatus: Pending, SubJobStatus: Pending, Duration: 41s, StartTime: 2021-11-09 04:54:26, EndTime: 0001-01-01 00:00:00
nodemanager-7cfb965848-vnr2p PodStatus: Pending, SubJobStatus: Pending, Duration: 41s, StartTime: 2021-11-09 04:54:26, EndTime: 0001-01-01 00:00:00
python-56bc7865-xxxqb PodStatus: Pending, SubJobStatus: Pending, Duration: 41s, StartTime: 2021-11-09 04:54:26, EndTime: 0001-01-01 00:00:00
rollsite-56878dd8b9-hdgf9 PodStatus: Pending, SubJobStatus: Pending, Duration: 41s, StartTime: 2021-11-09 04:54:26, EndTime: 0001-01-01 00:00:00
clustermanager-694bbccdc9-tjxt8 PodStatus: Pending, SubJobStatus: Pending, Duration: 41s, StartTime: 2021-11-09 04:54:26, EndTime: 0001-01-01 00:00:00
mysql-867fb9c446-qnppk PodStatus: Pending, SubJobStatus: Pending, Duration: 41s, StartTime: 2021-11-09 04:54:26, EndTime: 0001-01-01 00:00:00
nodemanager-0-5c8d46c664-vk2ff PodStatus: Pending, SubJobStatus: Pending, Duration: 41s, StartTime: 2021-11-09 04:54:26, EndTime: 0001-01-01 00:00:00
nodemanager-1-56d4f9676-llx4k PodStatus: Pending, SubJobStatus: Pending, Duration: 41s, StartTime: 2021-11-09 04:54:26, EndTime: 0001-01-01 00:00:00
等所有任务状态都变成 SUCCESS,则部署成功。
测试 FATE
在集群中的任意一台机器上执行:
kubectl exec -it svc/fateflow -c python -n fate-10000 -- bash
cd ../examples/toy_example/
python run_toy_example.py 10000 9999 1
成功的话会输出以下内容:
stdout:{
"data": {
"board_url": "http://fateboard:8080/index.html#/dashboard?job_id=202111090955384060332&role=guest&party_id=10000",
"job_dsl_path": "/data/projects/fate/jobs/202111090955384060332/job_dsl.json",
"job_id": "202111090955384060332",
"job_runtime_conf_on_party_path": "/data/projects/fate/jobs/202111090955384060332/guest/job_runtime_on_party_conf.json",
"job_runtime_conf_path": "/data/projects/fate/jobs/202111090955384060332/job_runtime_conf.json",
"logs_directory": "/data/projects/fate/logs/202111090955384060332",
"model_info": {
"model_id": "guest-10000#host-9999#model",
"model_version": "202111090955384060332"
},
"pipeline_dsl_path": "/data/projects/fate/jobs/202111090955384060332/pipeline_dsl.json",
"train_runtime_conf_path": "/data/projects/fate/jobs/202111090955384060332/train_runtime_conf.json"
},
"jobId": "202111090955384060332",
"retcode": 0,
"retmsg": "success"
}
job status is running
job status is running
job status is running
job status is running
job status is running
job status is running
job status is running
job status is running
job status is running
[INFO] [2021-11-09 09:55:41,741] [845:139855671260992] - secure_add_guest.py[line:99]: begin to init parameters of secure add example guest
[INFO] [2021-11-09 09:55:41,742] [845:139855671260992] - secure_add_guest.py[line:102]: begin to make guest data
[INFO] [2021-11-09 09:55:42,494] [845:139855671260992] - secure_add_guest.py[line:105]: split data into two random parts
[INFO] [2021-11-09 09:55:44,404] [845:139855671260992] - secure_add_guest.py[line:108]: share one random part data to host
[INFO] [2021-11-09 09:55:44,411] [845:139855671260992] - secure_add_guest.py[line:111]: get share of one random part data from host
[INFO] [2021-11-09 09:55:45,184] [845:139855671260992] - secure_add_guest.py[line:114]: begin to get sum of guest and host
[INFO] [2021-11-09 09:55:46,094] [845:139855671260992] - secure_add_guest.py[line:117]: receive host sum from guest
[INFO] [2021-11-09 09:55:46,122] [845:139855671260992] - secure_add_guest.py[line:124]: success to calculate secure_sum, it is 1999.9999999999993
...
可以到每台机器上看下 FATE 的部署情况:
[root@k8s-node1 ~]# kubectl get pods --namespace=fate-10000
NAME READY STATUS RESTARTS AGE
clustermanager-694bbccdc9-tjxt8 1/1 Running 0 5h4m
mysql-867fb9c446-qnppk 1/1 Running 0 5h4m
nodemanager-0-5c8d46c664-vk2ff 2/2 Running 0 5h4m
nodemanager-1-56d4f9676-llx4k 2/2 Running 0 5h4m
nodemanager-2-6dd74c79cb-7p5gj 2/2 Running 0 5h4m
nodemanager-7cfb965848-vnr2p 2/2 Running 0 5h4m
python-56bc7865-xxxqb 3/3 Running 1 5h4m
rollsite-56878dd8b9-hdgf9 1/1 Running 0 5h4m
````
新增Party
如果想在一个已有的联邦学习集群中,新增一方,那么首先需要修改cluster.yaml文件,把新的 Party 的信息配置上:
rollsite:
type: NodePort
nodePort: 10000
# exchange:
# ip: 192.168.0.1
# port: 30000
partyList:
- partyId: 9999
partyIp: 192.168.1.2
partyPort: 9370
- partyId: 30100
partyIp: 192.168.1.3
partyPort: 30100
# nodeSelector:
然后,执行集群更新命令:
kubefate cluster update -f cluster.yaml
同样也会创建几个 job,等 job 运行成功后就加好了。
当然,在对方也要配置上本 Party的信息后才能通信。
(全文完)扫描二维码,关注作者微信公众号
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK