【python】ray库使用

安装
- 案例运行
- - 案例代码（torch）
  - 运行输出
  - 解释
  - 案例代码（tensorflow）
  - 运行结果

安装

注意事项：
在windows下，需要python版本3.7以上，详见https://docs.ray.io/en/latest/ray-overview/installation.html
在这里插入图片描述
本人python版本3.9
直接pip install ray进行安装；
需要的库：
pyarrow；
torch；
tensorflow；

案例运行

案例代码（torch）

import torch
import torch.nn as nn

import ray
from ray import train
from ray.air import session, Checkpoint
from ray.train.torch import TorchTrainer
from ray.air.config import ScalingConfig


# If using GPUs, set this to True.
use_gpu = False


input_size = 1
layer_size = 15
output_size = 1
num_epochs = 3


class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.layer1 = nn.Linear(input_size, layer_size) #输入层到隐含层
        self.relu = nn.ReLU()                           #激活函数
        self.layer2 = nn.Linear(layer_size, output_size)#隐含层到输出层

    def forward(self, input):
        return self.layer2(self.relu(self.layer1(input))) #前向传播


def train_loop_per_worker():
    dataset_shard = session.get_dataset_shard("train")
    model = NeuralNetwork()
    loss_fn = nn.MSELoss()
    optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

    model = train.torch.prepare_model(model)

    for epoch in range(num_epochs):
        for batches in dataset_shard.iter_torch_batches(
            batch_size=32, dtypes=torch.float
        ):
            inputs, labels = torch.unsqueeze(batches["x"], 1), batches["y"]
            output = model(inputs)
            loss = loss_fn(output, labels)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            print(f"epoch: {epoch}, loss: {loss.item()}")

        session.report(
            {},
            checkpoint=Checkpoint.from_dict(
                dict(epoch=epoch, model=model.state_dict())
            ),
        )


train_dataset = ray.data.from_items([{"x": x, "y": 2 * x + 1} for x in range(200)])
scaling_config = ScalingConfig(num_workers=3, use_gpu=use_gpu)
trainer = TorchTrainer(
    train_loop_per_worker=train_loop_per_worker,
    scaling_config=scaling_config,
    datasets={"train": train_dataset},
)
result = trainer.fit()

运行输出

2023-09-06 09:56:24,749	INFO worker.py:1621 -- Started a local Ray instance.
##启动本地Ray实例

2023-09-06 09:56:34,561	INFO tune.py:666 -- [output] This will use the new output engine with verbosity 1. To disable the new output and use the legacy output engine, set the environment variable RAY_AIR_NEW_OUTPUT=0. For more information, please see https://github.com/ray-project/ray/issues/36949
这将使用冗长度为1的新输出引擎。要禁用新的输出并使用旧的输出引擎，可以设置环境变量RAY_AIR_NEW_OUTPUT=0。欲了解更多信息，请访问https://github.com/ray-project/ray/issues/36949

2023-09-06 09:56:34,571	INFO tensorboardx.py:178 -- pip install "ray[tune]" to see TensorBoard files.
pip安装“ray[tune]”来查看TensorBoard文件。

2023-09-06 09:56:34,572	WARNING callback.py:144 -- The TensorboardX logger cannot be instantiated because either TensorboardX or one of it's dependencies is not installed. Please make sure you have the latest version of TensorboardX installed: `pip install -U tensorboardx`
TensorboardX记录器不能被实例化，因为TensorboardX或它的一个依赖项没有安装。请确保你已经安装了最新版本的TensorboardX: ' pip install -U TensorboardX

2023-09-06 09:56:34,614	INFO data_parallel_trainer.py:404 -- GPUs are detected in your Ray cluster, but GPU training is not enabled for this trainer. To enable GPU training, make sure to set `use_gpu` to True in your scaling config.
在你的Ray集群中检测到GPU，但是GPU训练没有为这个训练器启用。要启用GPU训练，请确保在缩放配置中将' use_gpu '设置为True。
View detailed results here: c://\Users\lucia\ray_results\TorchTrainer_2023-09-06_09-56-34

Training started without custom configuration.
没有特定配置的训练开始。

(TrainTrainable pid=21644) GPUs are detected in your Ray cluster, but GPU training is not enabled for this trainer. To enable GPU training, make sure to set `use_gpu` to True in your scaling config.
在你的Ray集群中检测到GPU，但是GPU训练没有为这个训练器启用。要启用GPU训练，请确保在scaling配置中将' use_gpu '设置为True。
(TorchTrainer pid=21644) GPUs are detected in your Ray cluster, but GPU training is not enabled for this trainer. To enable GPU training, make sure to set `use_gpu` to True in your scaling config.
(TorchTrainer pid=21644) Starting distributed worker processes: ['6896 (127.0.0.1)', '15156 (127.0.0.1)', '18360 (127.0.0.1)']
启动分布式工作进程
(RayTrainWorker pid=6896) Setting up process group for: env:// [rank=0, world_size=3]   建立进程组
(RayTrainWorker pid=6896) Moving model to device: cuda:0   将模型移动到设备
(SplitCoordinator pid=18160) Auto configuring locality_with_output=['675a449d1eb8045a8c0e594f8eb41650ad4aba5b5270a544e0616bd5', '675a449d1eb8045a8c0e594f8eb41650ad4aba5b5270a544e0616bd5', '675a449d1eb8045a8c0e594f8eb41650ad4aba5b5270a544e0616bd5']
自动配置locality_with_output
(RayTrainWorker pid=6896) Wrapping provided model in DistributedDataParallel.   在distributeddataparliel中包装提供的模型
(SplitCoordinator pid=18160) Executing DAG InputDataBuffer[Input] -> OutputSplitter[split(3, equal=True)]
执行DAG InputDataBuffer[Input] -&gt;OutputSplitter(分裂(3 = = True))
(SplitCoordinator pid=18160) Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=2000000000.0), locality_with_output=['675a449d1eb8045a8c0e594f8eb41650ad4aba5b5270a544e0616bd5', '675a449d1eb8045a8c0e594f8eb41650ad4aba5b5270a544e0616bd5', '675a449d1eb8045a8c0e594f8eb41650ad4aba5b5270a544e0616bd5'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
(SplitCoordinator pid=18160) Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`
要查看详细的进度报告，请运行ray.data.DataContext.get_current().execution_options。verbose_progress = True '

(pid=18160) Running 0:   0%|          | 0/200 [00:00<?, ?it/s] 这是运行的进度
(RayTrainWorker pid=15156) D:\ANACONDA\anaconda3\lib\site-packages\torch\nn\modules\loss.py:528: UserWarning: Using a target size (torch.Size([32])) that is different to the input size (torch.Size([32, 1])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
目标大小(torch. size([32]))与输入大小(torch. size)不同。大小([32岁,1]))。这很可能会因为广播而导致错误的结果。请确保它们的尺寸相同。
(RayTrainWorker pid=15156)   return F.mse_loss(input, target, reduction=self.reduction)
(RayTrainWorker pid=15156) epoch: 0, loss: 13041.18359375  第0个epoch，计算的损失是13041.18359375
(RayTrainWorker pid=15156) D:\ANACONDA\anaconda3\lib\site-packages\torch\nn\modules\loss.py:528: UserWarning: Using a target size (torch.Size([2])) that is different to the input size (torch.Size([2, 1])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size. [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
(RayTrainWorker pid=15156)   return F.mse_loss(input, target, reduction=self.reduction) [repeated 3x across cluster]
Training finished iteration 1 at 2023-09-06 09:58:54. Total running time: 2min 20s
训练在2023-09-06 09:58:54完成迭代1。总时长:2分20秒
+------------------------------+
| Training result              |
+------------------------------+
| time_this_iter_s     130.029 |
| time_total_s         130.029 |
| training_iteration         1 |
+------------------------------+

(TorchTrainer pid=21644) Could not upload checkpoint to c://\Users\lucia\ray_results\TorchTrainer_2023-09-06_09-56-34\TorchTrainer_9b8e5_00000_0_2023-09-06_09-56-34 even after 3 retries.Please check if the credentials expired and that the remote filesystem is supported. For large checkpoints or artifacts, consider increasing `SyncConfig(sync_timeout)` (current value: 1800 seconds).
无法上传检查点到c://\Users\lucia\ray_results\TorchTrainer_2023-09-06_09-56-34\TorchTrainer_9b8e5_00000_0_2023-09-06_09-56-34即使重试3次。请检查凭据是否过期以及是否支持远程文件系统。对于大型检查点或工件，考虑增加' SyncConfig(sync_timeout) '(当前值:1800秒)。
(TorchTrainer pid=21644) Last sync command failed: Sync process failed: [WinError 32] Failed copying 'C:/Users/lucia/ray_results/TorchTrainer_2023-09-06_09-56-34/TorchTrainer_9b8e5_00000_0_2023-09-06_09-56-34/checkpoint_000000/.is_checkpoint' to '/Users/lucia/ray_results/TorchTrainer_2023-09-06_09-56-34/TorchTrainer_9b8e5_00000_0_2023-09-06_09-56-34/checkpoint_000000/.is_checkpoint'. Detail: [Windows error 32] 另一个程序正在使用此文件，进程无法访问。
复制失败：因为另一个程序正在使用此文件，进程无法访问。
(TorchTrainer pid=21644)

(TorchTrainer pid=21644) Could not upload checkpoint to c://\Users\lucia\ray_results\TorchTrainer_2023-09-06_09-56-34\TorchTrainer_9b8e5_00000_0_2023-09-06_09-56-34 even after 3 retries.Please check if the credentials expired and that the remote filesystem is supported. For large checkpoints or artifacts, consider increasing `SyncConfig(sync_timeout)` (current value: 1800 seconds).
无法上传检查点到c://\Users\lucia\ray_results\TorchTrainer_2023-09-06_09-56-34\TorchTrainer_9b8e5_00000_0_2023-09-06_09-56-34即使重试3次。请检查凭据是否过期以及是否支持远程文件系统。对于大型检查点或工件，考虑增加' SyncConfig(sync_timeout) '(当前值:1800秒)。

2023-09-06 09:59:20,462	WARNING tune.py:1122 -- Trial Runner checkpointing failed: Sync process failed: GetFileInfo() yielded path 'C:/Users/lucia/ray_results/TorchTrainer_2023-09-06_09-56-34/TorchTrainer_9b8e5_00000_0_2023-09-06_09-56-34', which is outside base dir 'C:\Users\lucia\ray_results\TorchTrainer_2023-09-06_09-56-34'
GetFileInfo()生成路径'C:/Users/lucia/ray_results/TorchTrainer_2023-09-06_09-56-34/TorchTrainer_9b8e5_00000_0_2023-09-06_09-56-34'，这是在基础目录'C:\Users\lucia\ray_results\TorchTrainer_2023-09-06_09-56-34'之外

Training completed after 3 iterations at 2023-09-06 09:59:20. Total running time: 2min 45s
训练在3次迭代后于2023-09-06 09:59:20完成。总时长:2分45秒

解释

1、想查看TensorBoard文件可以安装：
pip install ray[tune]
错误：Could not install packages due to an OSError: [WinError 5] 拒绝访问。: ‘D:\ANACONDA\anaconda3\Lib\site-packages\google\~rotobuf\internal\_api_implementation.cp39-win_amd64.pyd’
Consider using the --user option or check the permissions.
解决：用pip install --user ray[tune]
结果：

WARNING: Ignoring invalid distribution -rotobuf (d:\anaconda\anaconda3\lib\site-packages)
Installing collected packages: tensorboardX, pyarrow
  WARNING: The script plasma_store.exe is installed in 'C:\Users\lucia\AppData\Roaming\Python\Python39\Scripts' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
Successfully installed pyarrow-6.0.1 tensorboardX-2.6.2.2
WARNING: Ignoring invalid distribution -rotobuf (d:\anaconda\anaconda3\lib\site-packages)
WARNING: Ignoring invalid distribution -rotobuf (d:\anaconda\anaconda3\lib\site-packages)
WARNING: Ignoring invalid distribution -rotobuf (d:\anaconda\anaconda3\lib\site-packages)

2、谷歌的tensorflow的tensorboard是一个web服务器，用于可视化神经网络的训练过程，它可视化标量值、图像、文本等，主要用于保存tensorflow中的事件，比如学习率、loss等。遗憾的是，其他深度学习框架缺少这样的工具，因此，tensorboardX就出现了，这个软件包的目的是让研究人员使用一个简单的界面来记录PyTorch中的事件（然后在tensorboard中显示可视化）。tensorboardX软件包目前支持记录标量、图像、音频、直方图、文本、嵌入和反向传播路径

3、

案例代码（tensorflow）

# -*- coding: utf-8 -*-
"""
Created on Wed Sep  6 11:50:48 2023

@author: lucia
"""
import ray
import tensorflow as tf

from ray.air import session
from ray.air.integrations.keras import ReportCheckpointCallback
from ray.train.tensorflow import TensorflowTrainer
from ray.air.config import ScalingConfig


# If using GPUs, set this to True.
use_gpu = False

a = 5
b = 10
size = 100


def build_model() -> tf.keras.Model:
    model = tf.keras.Sequential(
        [
            tf.keras.layers.InputLayer(input_shape=()),
            # Add feature dimension, expanding (batch_size,) to (batch_size, 1).
            tf.keras.layers.Flatten(),
            tf.keras.layers.Dense(10),
            tf.keras.layers.Dense(1),
        ]
    )
    return model


def train_func(config: dict):
    batch_size = config.get("batch_size", 64)
    epochs = config.get("epochs", 3)

    strategy = tf.distribute.MultiWorkerMirroredStrategy()
    with strategy.scope():
        # Model building/compiling need to be within `strategy.scope()`.
        multi_worker_model = build_model()
        multi_worker_model.compile(
            optimizer=tf.keras.optimizers.SGD(learning_rate=config.get("lr", 1e-3)),
            loss=tf.keras.losses.mean_squared_error,
            metrics=[tf.keras.metrics.mean_squared_error],
        )

    dataset = session.get_dataset_shard("train")

    results = []
    for _ in range(epochs):
        tf_dataset = dataset.to_tf(
            feature_columns="x", label_columns="y", batch_size=batch_size
        )
        history = multi_worker_model.fit(
            tf_dataset, callbacks=[ReportCheckpointCallback()]
        )
        results.append(history.history)
    return results


config = {"lr": 1e-3, "batch_size": 32, "epochs": 4}

train_dataset = ray.data.from_items(
    [{"x": x / 200, "y": 2 * x / 200} for x in range(200)]
)
scaling_config = ScalingConfig(num_workers=2, use_gpu=use_gpu)
trainer = TensorflowTrainer(
    train_loop_per_worker=train_func,
    train_loop_config=config,
    scaling_config=scaling_config,
    datasets={"train": train_dataset},
)
result = trainer.fit()
print(result.metrics)

运行结果

1、报错：A Message class can only inherit from Message
解决：
在spyder中将 Tools->preferences -> python interpreter中的User Module Reloader关掉。
将Enable UMR、Show reloaded modules list 选项取消！！！
然后restart kernel即可！

2、

TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
 1. Downgrade the protobuf package to 3.20.x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates