使用Claude 3.5 Sonnet和Stable Diffusion XL：如何通过Amazon Bedrock不断优化图像生成直到满足需求

在Amazon Bedrock的AI模型中，Anthropic Claude 3系列现在新增了图像识别功能。特别是最新的Anthropic Claude 3.5 Sonnet，图像识别能力得到了显著提升。我进行了一些简单的试验和比较，深入探索了这些Claude模型在OCR（光学字符识别）方面的表现。

本文将展示如何充分利用Anthropic Claude 3.5 Sonnet的图像识别功能，通过Amazon Bedrock对Stability AI的Stable Diffusion XL (SDXL) 生成的图像进行验证和再生成。我们的目标是通过自动判定生成图像是否符合要求，从而减少人工目视确认的工作量。

注意事项：

本文及其他文章中提供的源代码是作为自主研究活动的一部分创建的。使用时请自行负责，并注意可能会有未经通知的修改。
本文在个人注册的AWS账户上使用AWS服务进行编写。
本文所用的Amazon Bedrock模型版本为2024年7月20日（JST）执行的，并依据当时的最终用户许可协议（EULA）使用。
- Anthropic Claude 3.5 Sonnet（anthropic.claude-3-5-sonnet-20240620-v1:0）：Anthropic on Bedrock - 商业服务条款（生效日期：2024年1月2日）
- Stability AI Stable Diffusion XL（stability.stable-diffusion-xl-v1）：STABILITY AMAZON BEDROCK 终端用户许可协议（最后更新日期：2024年4月29日）

架构图和处理流程

实现本次主题的架构图如下所示。

Using Amazon Bedrock to repeatedly generate images with Stable Diffusion XL via Claude 3.5 Sonnet until requirements are met

以下是对该处理流程的详细解释：

输入包含提示词和参数的事件。

2-1. 使用输入的提示词在Amazon Bedrock上运行SDXL模型以生成图像。

2-2. 将生成的图像保存到Amazon S3中。

2-3. 对保存到Amazon S3中的图像，使用Amazon Bedrock上的Claude 3.5 Sonnet模型进行验证，检查图像是否符合生成图像时的提示词要求。

如果图像未满足提示词要求，则从2-1开始，按照指定的提示词执行次数重复处理2-1和2-2。

如果图像符合提示词要求，则将该图像作为输出结果。

3.如果未超过修正提示词的执行次数，并且图像未满足提示词要求的次数超过了指定的执行次数，则使用Amazon Bedrock上的Claude 3.5 Sonnet模型对生成图像的提示词进行修正，使其更可能满足要求，并重新开始2-1的处理。

如果修正提示词的执行次数超过了规定次数，则将处理视为错误并终止。

在这个处理流程中，关键点是使用Claude 3.5 Sonnet模型对图像生成提示词的修正。

如果图像生成提示词对AI来说易于理解，那么通过几次执行，满足要求的图像很可能会被输出。然而，如果提示词对AI来说难以理解，可能会导致无法生成满足要求的图像。

因此，当超过指定的相同提示词执行次数后，我们会使用Amazon Bedrock上的Claude 3.5 Sonnet模型对图像生成提示词进行优化和修正，以提高生成满足要求图像的可能性。

实现示例

输入事件的格式

{
    "prompt": "<用于生成图像的初始提示>",
    "max_retry_attempts": <每个提示尝试生成图像的最大次数>,
    "max_prompt_revisions": <修正提示的最大次数>,
    "output_s3_bucket_name": "<保存生成图像的S3桶的名称>",
    "output_s3_key_prefix": "<生成图像的S3键前缀>",
    "claude_validate_temperature": <图像验证时Claude模型的temperature参数（0.0〜1.0）>,
    "claude_validate_top_p": <图像验证时Claude模型的top-p参数（0.0〜1.0）>,
    "claude_validate_top_k": <图像验证时Claude模型的top-k参数>,
    "claude_validate_max_tokens": <图像验证时Claude模型生成的最大token数>,
    "claude_revise_temperature": <修正提示时Claude模型的temperature参数（0.0〜1.0）>,
    "claude_revise_top_p": <修正提示时Claude模型的top-p参数（0.0〜1.0）>,
    "claude_revise_top_k": <修正提示时Claude模型的top-k参数>,
    "claude_revise_max_tokens": <修正提示时Claude模型生成的最大token数>,
    "sdxl_cfg_scale": <SDXL模型的CFG比例>,
    "sdxl_steps": <SDXL模型的步骤数>,
    "sdxl_width": <SDXL模型生成图像的宽度>,
    "sdxl_height": <SDXL模型生成图像的高度>,
    "sdxl_seed": <SDXL模型的随机种子>
}

输入事件示例

{
    "prompt": "A serene landscape with mountains and a lake",
    "max_retry_attempts": 5,
    "max_prompt_revisions": 3,
    "output_s3_bucket_name": "your-output-bucket-name",
    "output_s3_key_prefix": "generated-images",
    "claude_validate_temperature": 1.0,
    "claude_validate_top_p": 0.999,
    "claude_validate_top_k": 250,
    "claude_validate_max_tokens": 4096,
    "claude_revise_temperature": 1.0,
    "claude_revise_top_p": 0.999,
    "claude_revise_top_k": 250,
    "claude_revise_max_tokens": 4096,
    "sdxl_cfg_scale": 30,
    "sdxl_steps": 150,
    "sdxl_width": 1024,
    "sdxl_height": 1024,
    "sdxl_seed": 0
}

SOURCECODE

本次的实现的示例代码如下：

# #Event Sample
# {
#     "prompt": "A serene landscape with mountains and a lake",
#     "max_retry_attempts": 5,
#     "max_prompt_revisions": 3,
#     "output_s3_bucket_name": "your-output-bucket-name",
#     "output_s3_key_prefix": "generated-images",
#     "claude_validate_temperature": 1.0,
#     "claude_validate_top_p": 0.999,
#     "claude_validate_top_k": 250,
#     "claude_validate_max_tokens": 4096,
#     "claude_revise_temperature": 1.0,
#     "claude_revise_top_p": 0.999,
#     "claude_revise_top_k": 250,
#     "claude_revise_max_tokens": 4096,
#     "sdxl_cfg_scale": 30,
#     "sdxl_steps": 150,
#     "sdxl_width": 1024,
#     "sdxl_height": 1024,
#     "sdxl_seed": 0
# }
import boto3
import json
import base64
import os
import sys
from io import BytesIO
import datetime
import random
region = os.environ.get('AWS_REGION')
bedrock_runtime_client = boto3.client('bedrock-runtime', region_name=region)
s3_client = boto3.client('s3', region_name=region)
def claude3_5_invoke_model(input_prompt, image_media_type=None, image_data_base64=None, model_params={}):
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": input_prompt
                }
            ]
        }
    ]
    
    if image_media_type and image_data_base64:
        messages[0]["content"].insert(0, {
            "type": "image",
            "source": {
                "type": "base64",
                "media_type": image_media_type,
                "data": image_data_base64
            }
        })
    body = {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": model_params.get('max_tokens', 4096),
        "messages": messages,
        "temperature": model_params.get('temperature', 1.0),
        "top_p": model_params.get('top_p', 0.999),
        "top_k": model_params.get('top_k', 250),
        "stop_sequences": ["\n\nHuman:"]
    }
    response = bedrock_runtime_client.invoke_model(
        modelId='anthropic.claude-3-5-sonnet-20240620-v1:0',
        contentType='application/json',
        accept='application/json',
        body=json.dumps(body)
    )
    response_body = json.loads(response.get('body').read())
    response_text = response_body["content"][0]["text"]
    return response_text
def sdxl_invoke_model(prompt, model_params={}):
    seed = model_params.get('sdxl_seed', 0)
    if seed == 0:
        seed = random.randint(0, 4294967295)
    
    body = {
        "text_prompts": [{"text": prompt}],
        "cfg_scale": model_params.get('cfg_scale', 10),
        "steps": model_params.get('steps', 50),
        "width": model_params.get('sdxl_width', 1024),
        "height": model_params.get('sdxl_height', 1024),
        "seed": seed
    }
    
    print(f"SDXL model parameters: {body}")
    
    response = bedrock_runtime_client.invoke_model(
        body=json.dumps(body),
        modelId="stability.stable-diffusion-xl-v1",
        contentType="application/json",
        accept="application/json"
    )
    
    response_body = json.loads(response['body'].read())
    image_data = base64.b64decode(response_body['artifacts'][0]['base64'])
    print(f"Image generated successfully with seed: {seed}")
    return image_data
def save_image_to_s3(image_data, bucket, key):
    s3_client.put_object(
        Bucket=bucket,
        Key=key,
        Body=image_data
    )
    print(f"Image saved to S3: s3://{bucket}/{key}")
def validate_image(image_data, prompt, claude_validate_params):
    image_base64 = base64.b64encode(image_data).decode('utf-8')
    
    input_prompt = f"""Does this image match the following prompt? Prompt: {prompt}. 
    Please answer in the following JSON format:
    {{"result":"", "reason":""}}
    Ensure your response can be parsed as valid JSON. Do not include any explanations, comments, or additional text outside of the JSON structure."""
    validation_result = claude3_5_invoke_model(input_prompt, "image/png", image_base64, claude_validate_params)
    
    try:
        print(f"validation Result: {validation_result}")
        parsed_result = json.loads(validation_result)
        is_valid = parsed_result['result'].upper() == 'YES'
        print(f"Image validation result: {is_valid}")
        print(f"Validation reason: {parsed_result['reason']}")
        return is_valid
    except json.JSONDecodeError:
        print(f"Error parsing validation result: {validation_result}")
        return False
def revise_prompt(original_prompt, claude_revise_params):
    input_prompt = f"""Revise the following image generation prompt to optimize it for Stable Diffusion, incorporating best practices:
    {original_prompt}
    Please consider the following guidelines in your revision:
    1. Be specific and descriptive, using vivid adjectives and clear nouns.
    2. Include details about composition, lighting, style, and mood.
    3. Mention specific artists or art styles if relevant.
    4. Use keywords like "highly detailed", "4k", "8k", or "photorealistic" if appropriate.
    5. Separate different concepts with commas.
    6. Place more important elements at the beginning of the prompt.
    7. Use weights (e.g., (keyword:1.2)) for emphasizing certain elements if necessary.
    8. If the original prompt is not in English, translate it to English.
    Your goal is to create a clear, detailed prompt that will result in a high-quality image generation with Stable Diffusion.
    Please provide your response in the following JSON format:
    {{"revised_prompt":""}}
    Ensure your response can be parsed as valid JSON. Do not include any explanations, comments, or additional text outside of the JSON structure."""
    revised_prompt_json = claude3_5_invoke_model(input_prompt, model_params=claude_revise_params)
    print(f"Original prompt: {original_prompt}")
    print(f"Revised prompt JSON: {revised_prompt_json.strip()}")
    
    try:
        parsed_result = json.loads(revised_prompt_json)
        revised_prompt = parsed_result['revised_prompt']
        print(f"Parsed revised prompt: {revised_prompt}")
        return revised_prompt
    except json.JSONDecodeError:
        print(f"Error parsing revised prompt result: {revised_prompt_json}")
        return original_prompt
def lambda_handler(event, context):
    try:
        initial_prompt = event['prompt']
        prompt = initial_prompt
        max_retry_attempts = max(0, event.get('max_retry_attempts', 5) - 1)
        max_prompt_revisions = max(0, event.get('max_prompt_revisions', 3) - 1)
        output_s3_bucket_name = event['output_s3_bucket_name']
        output_s3_key_prefix = event.get('output_s3_key_prefix', 'generated-images')
        print(f"Initial prompt: {initial_prompt}")
        print(f"Max retry attempts: {max_retry_attempts}")
        print(f"Max prompt revisions: {max_prompt_revisions}")
        # Model parameters
        claude_validate_params = {
            'temperature': event.get('claude_validate_temperature', 1.0),
            'top_p': event.get('claude_validate_top_p', 0.999),
            'top_k': event.get('claude_validate_top_k', 250),
            'max_tokens': event.get('claude_validate_max_tokens', 4096)
        }
        claude_revise_params = {
            'temperature': event.get('claude_revise_temperature', 1.0),
            'top_p': event.get('claude_revise_top_p', 0.999),
            'top_k': event.get('claude_revise_top_k', 250),
            'max_tokens': event.get('claude_revise_max_tokens', 4096)
        }
        sdxl_params = {
            'cfg_scale': event.get('sdxl_cfg_scale', 7),
            'steps': event.get('sdxl_steps', 50),
            "width": event.get('sdxl_width', 1024),
            "height": event.get('sdxl_height', 1024),
            "seed": event.get('sdxl_seed', 0)
        }
        print(f"Claude validate params: {claude_validate_params}")
        print(f"Claude revise params: {claude_revise_params}")
        print(f"SDXL params: {sdxl_params}")
        # Generate start timestamp and S3 key
        start_timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S")
        for revision in range(max_prompt_revisions + 1):
            print(f"Starting revision {revision}")
            for attempt in range(max_retry_attempts + 1):
                print(f"Attempt {attempt} for generating image")
                
                # Generate image with SDXL
                image_data = sdxl_invoke_model(prompt, sdxl_params)
                image_key = f"{output_s3_key_prefix}-{start_timestamp}-{revision:03d}-{attempt:03d}.png"
                # Save image to S3
                save_image_to_s3(image_data, output_s3_bucket_name, image_key)
                # Validate image with Claude
                is_valid = validate_image(image_data, initial_prompt, claude_validate_params)
                if is_valid:
                    print("Valid image generated successfully")
                    return {
                        'statusCode': 200,
                        'body': json.dumps({
                            'status': 'SUCCESS',
                            'message': 'Image generated successfully',
                            'output_s3_bucket_url': f'https://s3.console.aws.amazon.com/s3/buckets/{output_s3_bucket_name}',
                            'output_s3_object_url': f'https://s3.console.aws.amazon.com/s3/object/{output_s3_bucket_name}?region={region}&prefix={image_key}'
                        })
                    }
            # If max retry attempts reached and not the last revision, revise prompt
            if revision < max_prompt_revisions:
                print("Revising prompt")
                prompt = revise_prompt(initial_prompt, claude_revise_params)
        print("Failed to generate a valid image after all attempts and revisions")
        return {
            'statusCode': 400,
            'body': json.dumps({
                'status': 'FAIL',
                'error': 'Failed to generate a valid image after all attempts and revisions'
            })
        }
    except Exception as ex:
        print(f'Exception: {ex}')
        tb = sys.exc_info()[2]
        err_message = f'Exception: {str(ex.with_traceback(tb))}'
        print(err_message)
        return {
            'statusCode': 500,
            'body': json.dumps({
                'status': 'FAIL',
                'error': err_message
            })
        }

在这段源代码中，进行了以下巧妙的设计：

实现了图像生成和验证的自动化循环，直到满足要求为止
使用Claude 3.5 Sonnet对生成的图像进行验证并修正提示
使用Stable Diffusion XL进行高质量的图像生成
在提示修正指示中包含具体的图像生成最佳实践
使图像生成参数（cfg_scale、steps、width、height、seed）可以自定义
调整Claude 3.5 Sonnet的调用参数（temperature、top_p、top_k、max_tokens）
自动将生成的图像保存到S3桶中，并返回结果的URL
适当实现错误处理和日志输出，以便于故障排除
使用JSON格式与Claude进行结构化对话，简化结果解析
设置最大重试次数和最大提示修正次数，以防止无限循环

执行内容与结果

执行示例：输入参数

{
    "prompt": "从大自然中看到的夜景，天空中有极光、月亮和流星雨，地面上是广阔的海洋，漂浮着流冰，地平线上升起太阳的无人照片。",
    "max_retry_attempts": 5,
    "max_prompt_revisions": 5,
    "output_s3_bucket_name": "ho2k.com",
    "output_s3_key_prefix": "generated-images",
    "claude_validate_temperature": 1,
    "claude_validate_top_p": 0.999,
    "claude_validate_top_k": 250,
    "claude_validate_max_tokens": 4096,
    "claude_revise_temperature": 1,
    "claude_revise_top_p": 0.999,
    "claude_revise_top_k": 250,
    "claude_revise_max_tokens": 4096,
    "sdxl_cfg_scale": 30,
    "sdxl_steps": 150,
    "sdxl_width": 1024,
    "sdxl_height": 1024,
    "sdxl_seed": 0
}

本次执行示例的输入参数包含以下优化：

将max_retry_attempts设置为5，以提高图像生成的成功率。
将max_prompt_revisions设置为5，增加了必要时改进提示词的机会。
细致设置了用于图像验证和修正的Claude模型参数（temperature, top_p, top_k, max_tokens）。
将sdxl_cfg_scale设置为30，提高了对提示词的忠实度。
将sdxl_steps设置为150，提高了图像生成的质量。
使用随机seed进行图像生成，每次生成不同的图像。

执行示例：结果

生成的图像

在此次试验中，最终生成并通过验证的图像如下：

该图像基本满足提示词要求：“从大自然中看到的夜景，天空中有极光、月亮和流星雨，地面上是广阔的海洋，漂浮着流冰，地平线上升起太阳的无人照片。”尽管流星群和流冰的表现稍弱，但月亮和地平线上的太阳这一矛盾的场景得到了很好地呈现。

此外，通过与之前生成的图像（参见下文的“生成的图像列表”）对比可以确认，最终通过验证的图像比之前生成的其他图像更能满足指示的要求。

满足提示要求并通过验证的图像

以下是本次试验生成的图像列表。这些图像的每一行都是由不同的修正提示生成的。

在最初输入的日文提示生成的图像中，生成的图像与要求相差较大。而从第一次提示修正之后，生成的图像逐渐接近要求。

修正后的提示变化

以上“生成的图像列表”中的每一行图像都是由不同的修正提示生成的。具体而言，“生成的图像列表”中的第一行图像是由以下“修正0次”的提示生成的，而“生成的图像列表”中的最后一行图像是由以下“修正3次”的提示生成的。

让我们来看一下每次修正后的图像生成提示内容。

第0次修正

从大自然中看到的夜景，天空中有极光、月亮和流星雨，地面上是广阔的海洋，漂浮着流冰，地平线上升起太阳的无人照片。

第1次修正

Breathtaking night landscape, vibrant aurora borealis (aurora:1.2) dancing across the starry sky, crescent moon illuminating the scene, meteor shower streaking through the atmosphere, vast arctic ocean with floating ice floes, midnight sun peeking over the horizon, ethereal glow, long exposure photography, ultra high resolution 8k, photorealistic, highly detailed, dramatic lighting, panoramic composition, inspired by Aurora Borealis paintings of Frederic Edwin Church

第2次修正

Breathtaking Arctic night landscape, vibrant green aurora borealis dancing across the starry sky, full moon illuminating the scene, meteor shower streaking through the atmosphere, vast icy ocean with floating ice floes, midnight sun peeking over the horizon, ultra-high resolution 8K, photorealistic, highly detailed, dramatic lighting, panoramic composition, inspired by Albert Bierstadt and Frederic Edwin Church, (aurora borealis:1.3), (meteor shower:1.2), serene and majestic atmosphere

第3次修正

Breathtaking nightscape from nature's perspective, vibrant aurora borealis (aurora:1.2) dancing across the starry sky, crescent moon illuminating the scene, meteor shower streaking through the atmosphere, vast ocean with floating ice floes, sun peeking over the horizon creating a golden glow, no human presence, photorealistic, highly detailed, 8k resolution, dramatic lighting, wide-angle composition, inspired by Thomas Kinkade and Aurora HDR style, serene and awe-inspiring mood

特别是与上述“生成的图像列表”对比来看，最初输入的日文提示词并未针对图像生成进行优化，因此输出的图像与要求相差较大。而从修正第一次开始，通过Claude 3.5 Sonnet模型进行优化后的提示词生成的图像更接近要求。

通过每次提示词的修正和图像生成的执行，最终生成的图像满足了提示词的要求并通过了验证。