参考资料
-
ec2-image-builder-workshop
-
Troubleshoot EC2 Image Builder
理解imagebuilder
imagebuilder 使用 cinc-client 进行客户端统一配置,CINC is not Chef,而是chef的免费分发版本。
https://cinc.sh/about/
imagebuilder管道的整体逻辑如下
核心概念的关系如下图
-
recipe,包含一个parent image和一个或多个components
-
component,是recipe的构建块,描述了如何构建、验证和测试映像
-
Infrastructure,定义了构建和测试映像的环境
-
distribution,配置指定分发到选定的 AWS 区域、帐户或组织
运行命令和日志的细节可以参考,Under the Hood
构建pcluster自定义ami
官方pcluster作为源
之前的pcluster文章介绍了通过pcluster工具创建ami,实际上就是使用了imagebuilder
Image Builder 使用 SSM 自动化以协调映像构建操作。要查看其他详细信息以帮助排除生成故障,需要在控制台中搜索Image Builder 提供的执行 ID,然后检查 Automation 执行
Resource handler returned message: "Error occurred during operation 'SSM execution 'a13bc224-150b-47ae-8e9d-47f3bdc4dc48' failed for image arn: 'arn:aws-cn:imagebuilder:cn-north-1:xxxxxxx:image/parallelclusterimage-myubuntu1804/3.1.4/1' with status = 'Failed' in state = 'BUILDING' and failure message = 'Document arn:aws-cn:imagebuilder:cn-north-1:xxxxxxx:component/parallelclusterimage-de178710-9674-11ed-b264-0e2b2c28fce2/3.1.4/1 failed!''." (RequestToken: 273970de-d749-1216-1215-06466707ae47, HandlerErrorCode: GeneralServiceException)
查看具体的错误细节,和cfn的报错一致,具体需要查看对应document的错误日志
在document的cwlogs中查看构建自定义ami的报错(日志来自image builder)
可见是由于pcluser命令行版本3.1.4,ami对应pcluster版本为3.2.1,版本不一致导致报错
================================================================================
Stdout: Recipe Compile Error in /etc/chef/local-mode-cache/cache/cookbooks/aws-parallelcluster/attributes/conditions.rb
Stdout: ================================================================================
Stdout:
Stdout: RuntimeError
Stdout: ------------
Stdout: This AMI was created with aws-parallelcluster-cookbook-3.2.1, but is trying to be used with aws-parallelcluster-cookbook-3.1.4. Please either use an AMI created with aws-parallelcluster-cookbook-3.1.4 or change your ParallelCluster to aws-parallelcluster-cookbook-3.2.1
修改版本一致后构建成功,之后使用自定义ami创建集群即可
Region: cn-north-1
Image:
Os: ubuntu1804
CustomAmi: ami-003819348308f4f4f
HeadNode:
InstanceType: m5.large
...
公开ami作为源
之前选择的是pcluster的官方ami版本, aws-parallelcluster-3.2.1-ubuntu-1804-lts-hvm-x86_64-202209270835
,尝试使用普通的ubuntu ami能否顺利构建
Build:
InstanceType: c5.4xlarge
ParentImage: ami-07356f2da3fd22521
SubnetId: subnet-xxxxxxxxx
SecurityGroupIds:
- sg-xxxxxxxx
UpdateOsPackages:
Enabled: true
cfn堆栈报错如下
Resource handler returned message: "Error occurred during operation 'SSM execution 'cb055f7d-7c07-471a-9d3a-06a900926f8e' failed for image arn: 'arn:aws-cn:imagebuilder:cn-north-1:xxxxxxx:image/parallelclusterimage-myubuntu1804raw/3.2.1/1' with status = 'Failed' in state = 'BUILDING' and failure message = 'Document arn:aws-cn:imagebuilder:cn-north-1:xxxxxxx:component/parallelclusterimage-f78ad100-9685-11ed-89e5-06b4c2e890aa/3.2.1/1 failed!''." (RequestToken: ea6df8f2-d076-43b7-8893-44c567a70a34, HandlerErrorCode: GeneralServiceException)
还是一样的套路寻找错误原因
Command 9647e5df-dfe4-49f5-aab2-f6843bf55c16 returns unexpected invocation result:
{Status=[Failed], ResponseCode=[1], Output=[{
"executionId": "c0466b39-9686-11ed-8042-0651be0b5200",
"status": "failed",
"failedStepCount": 1,
"executedStepCount": 24,
"ignoredFailedStepCount": 0,
"failureMessage": "Document arn:aws-cn:imagebuilder:cn-north-1:xxxxxxx:component/parallelclusterimage-f78ad100-9685-11ed-89e5-06b4c2e890aa/3.2.1/1 failed!",
"logUrl": "/var/lib/amazon/toe/TOE_2023-01-17_16-48-21_UTC-0_c0466b39-9686-11ed-8042-0651be0b5200"
}
查看cwlogs日志,这就有点尴尬了
STDERR: fatal: unable to access 'https://github.com/pyenv/pyenv-virtualenv/': gnutls_handshake() failed: The TLS connection was non-properly terminated.
Ran git ls-remote "https://github.com/pyenv/pyenv-virtualenv" "master*" returned 128
没有找到配置代理的地方,暂时无奈放弃
通过userdata分析报错
构建成功后启动pcluster头节点的userdata,只保留主要逻辑如下
- 检查cookbook和pcluster版本是否一致
- 检查ami是否被pcluster支持
- 运行chef配置节点
#!/bin/bash -x
...
function vendor_cookbook
{
mkdir /tmp/cookbooks
cd /tmp/cookbooks
tar -xzf /etc/chef/aws-parallelcluster-cookbook.tgz
HOME_BAK="${HOME}"
export HOME="/tmp"
for d in `ls /tmp/cookbooks`; do
cd /tmp/cookbooks/$d
LANG=en_US.UTF-8 /opt/cinc/embedded/bin/berks vendor /etc/chef/cookbooks --delete || error_exit 'Vendoring cookbook failed.'
done;
export HOME="${HOME_BAK}"
}
...
custom_cookbook=NONE
export _region=cn-north-1
s3_url=amazonaws.com.cn
if [ "${custom_cookbook}" != "NONE" ]; then
if [[ "${custom_cookbook}" =~ ^s3://([^/]*)(.*) ]]; then
bucket_region=$(aws s3api get-bucket-location --bucket ${BASH_REMATCH[1]} | jq -r '.LocationConstraint')
if [[ "${bucket_region}" == null ]]; then
bucket_region="us-east-1"
fi
cookbook_url=$(aws s3 presign "${custom_cookbook}" --region "${bucket_region}")
else
cookbook_url=${custom_cookbook}
fi
fi
export parallelcluster_version=aws-parallelcluster-3.2.1
export cookbook_version=aws-parallelcluster-cookbook-3.2.1
export chef_version=17.2.29
export berkshelf_version=7.2.0
if [ -f /opt/parallelcluster/.bootstrapped ]; then
installed_version=$(cat /opt/parallelcluster/.bootstrapped)
if [ "${cookbook_version}" != "${installed_version}" ]; then
error_exit "This AMI was created with ${installed_version}, but is trying to be used with ${cookbook_version}. Please either use an AMI created with ${cookbook_version} or change your ParallelCluster to ${installed_version}"
fi
else
error_exit "This AMI was not baked by ParallelCluster. Please use pcluster build-image command to create an AMI by providing your AMI as parent image."
fi
if [ "${custom_cookbook}" != "NONE" ]; then
curl --retry 3 -v -L -o /etc/chef/aws-parallelcluster-cookbook.tgz ${cookbook_url}
vendor_cookbook
fi
由此可见,构建自定义ami出现的错误实际上是在测试镜像阶段检测版本不一致导致的。
查看/etc/chef/cookbooks
目录,是recipe菜单目录
$ tree -L 1
/etc/chef/cookbooks
├── apt
├── aws-parallelcluster
├── aws-parallelcluster-awsbatch
├── aws-parallelcluster-config
├── aws-parallelcluster-install
├── aws-parallelcluster-scheduler-plugin
├── aws-parallelcluster-slurm
├── aws-parallelcluster-test
├── iptables
├── line
├── nfs
├── openssh
├── pyenv
├── selinux
├── yum
└── yum-epel
具体报错需要结合内部的ruby代码进行分析了