【python】ray库使用

news2024/11/23 13:36:15

【python】ray库使用

  • 安装
    • 案例运行
      • 案例代码(torch)
      • 运行输出
      • 解释
      • 案例代码(tensorflow)
      • 运行结果

安装

注意事项:
在windows下,需要python版本3.7以上,详见https://docs.ray.io/en/latest/ray-overview/installation.html
在这里插入图片描述
本人python版本3.9
直接pip install ray进行安装;
需要的库:
pyarrow;
torch;
tensorflow;

案例运行

案例代码(torch)

import torch
import torch.nn as nn

import ray
from ray import train
from ray.air import session, Checkpoint
from ray.train.torch import TorchTrainer
from ray.air.config import ScalingConfig


# If using GPUs, set this to True.
use_gpu = False


input_size = 1
layer_size = 15
output_size = 1
num_epochs = 3


class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.layer1 = nn.Linear(input_size, layer_size) #输入层到隐含层
        self.relu = nn.ReLU()                           #激活函数
        self.layer2 = nn.Linear(layer_size, output_size)#隐含层到输出层

    def forward(self, input):
        return self.layer2(self.relu(self.layer1(input))) #前向传播


def train_loop_per_worker():
    dataset_shard = session.get_dataset_shard("train")
    model = NeuralNetwork()
    loss_fn = nn.MSELoss()
    optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

    model = train.torch.prepare_model(model)

    for epoch in range(num_epochs):
        for batches in dataset_shard.iter_torch_batches(
            batch_size=32, dtypes=torch.float
        ):
            inputs, labels = torch.unsqueeze(batches["x"], 1), batches["y"]
            output = model(inputs)
            loss = loss_fn(output, labels)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            print(f"epoch: {epoch}, loss: {loss.item()}")

        session.report(
            {},
            checkpoint=Checkpoint.from_dict(
                dict(epoch=epoch, model=model.state_dict())
            ),
        )


train_dataset = ray.data.from_items([{"x": x, "y": 2 * x + 1} for x in range(200)])
scaling_config = ScalingConfig(num_workers=3, use_gpu=use_gpu)
trainer = TorchTrainer(
    train_loop_per_worker=train_loop_per_worker,
    scaling_config=scaling_config,
    datasets={"train": train_dataset},
)
result = trainer.fit()

运行输出

2023-09-06 09:56:24,749	INFO worker.py:1621 -- Started a local Ray instance.
##启动本地Ray实例

2023-09-06 09:56:34,561	INFO tune.py:666 -- [output] This will use the new output engine with verbosity 1. To disable the new output and use the legacy output engine, set the environment variable RAY_AIR_NEW_OUTPUT=0. For more information, please see https://github.com/ray-project/ray/issues/36949
这将使用冗长度为1的新输出引擎。要禁用新的输出并使用旧的输出引擎,可以设置环境变量RAY_AIR_NEW_OUTPUT=0。欲了解更多信息,请访问https://github.com/ray-project/ray/issues/36949

2023-09-06 09:56:34,571	INFO tensorboardx.py:178 -- pip install "ray[tune]" to see TensorBoard files.
pip安装“ray[tune]”来查看TensorBoard文件。

2023-09-06 09:56:34,572	WARNING callback.py:144 -- The TensorboardX logger cannot be instantiated because either TensorboardX or one of it's dependencies is not installed. Please make sure you have the latest version of TensorboardX installed: `pip install -U tensorboardx`
TensorboardX记录器不能被实例化,因为TensorboardX或它的一个依赖项没有安装。请确保你已经安装了最新版本的TensorboardX: ' pip install -U TensorboardX

2023-09-06 09:56:34,614	INFO data_parallel_trainer.py:404 -- GPUs are detected in your Ray cluster, but GPU training is not enabled for this trainer. To enable GPU training, make sure to set `use_gpu` to True in your scaling config.
在你的Ray集群中检测到GPU,但是GPU训练没有为这个训练器启用。要启用GPU训练,请确保在缩放配置中将' use_gpu '设置为True。
View detailed results here: c://\Users\lucia\ray_results\TorchTrainer_2023-09-06_09-56-34

Training started without custom configuration.
没有特定配置的训练开始。

(TrainTrainable pid=21644) GPUs are detected in your Ray cluster, but GPU training is not enabled for this trainer. To enable GPU training, make sure to set `use_gpu` to True in your scaling config.
在你的Ray集群中检测到GPU,但是GPU训练没有为这个训练器启用。要启用GPU训练,请确保在scaling配置中将' use_gpu '设置为True(TorchTrainer pid=21644) GPUs are detected in your Ray cluster, but GPU training is not enabled for this trainer. To enable GPU training, make sure to set `use_gpu` to True in your scaling config.
(TorchTrainer pid=21644) Starting distributed worker processes: ['6896 (127.0.0.1)', '15156 (127.0.0.1)', '18360 (127.0.0.1)']
启动分布式工作进程
(RayTrainWorker pid=6896) Setting up process group for: env:// [rank=0, world_size=3]   建立进程组
(RayTrainWorker pid=6896) Moving model to device: cuda:0   将模型移动到设备
(SplitCoordinator pid=18160) Auto configuring locality_with_output=['675a449d1eb8045a8c0e594f8eb41650ad4aba5b5270a544e0616bd5', '675a449d1eb8045a8c0e594f8eb41650ad4aba5b5270a544e0616bd5', '675a449d1eb8045a8c0e594f8eb41650ad4aba5b5270a544e0616bd5']
自动配置locality_with_output
(RayTrainWorker pid=6896) Wrapping provided model in DistributedDataParallel.   在distributeddataparliel中包装提供的模型
(SplitCoordinator pid=18160) Executing DAG InputDataBuffer[Input] -> OutputSplitter[split(3, equal=True)]
执行DAG InputDataBuffer[Input] ->OutputSplitter(分裂(3 = = True))
(SplitCoordinator pid=18160) Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=2000000000.0), locality_with_output=['675a449d1eb8045a8c0e594f8eb41650ad4aba5b5270a544e0616bd5', '675a449d1eb8045a8c0e594f8eb41650ad4aba5b5270a544e0616bd5', '675a449d1eb8045a8c0e594f8eb41650ad4aba5b5270a544e0616bd5'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
(SplitCoordinator pid=18160) Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`
要查看详细的进度报告,请运行ray.data.DataContext.get_current().execution_options。verbose_progress = True '

(pid=18160) Running 0:   0%|          | 0/200 [00:00<?, ?it/s] 这是运行的进度
(RayTrainWorker pid=15156) D:\ANACONDA\anaconda3\lib\site-packages\torch\nn\modules\loss.py:528: UserWarning: Using a target size (torch.Size([32])) that is different to the input size (torch.Size([32, 1])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
目标大小(torch. size([32]))与输入大小(torch. size)不同。大小([32,1]))。这很可能会因为广播而导致错误的结果。请确保它们的尺寸相同。
(RayTrainWorker pid=15156)   return F.mse_loss(input, target, reduction=self.reduction)
(RayTrainWorker pid=15156) epoch: 0, loss: 13041.183593750个epoch,计算的损失是13041.18359375
(RayTrainWorker pid=15156) D:\ANACONDA\anaconda3\lib\site-packages\torch\nn\modules\loss.py:528: UserWarning: Using a target size (torch.Size([2])) that is different to the input size (torch.Size([2, 1])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size. [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
(RayTrainWorker pid=15156)   return F.mse_loss(input, target, reduction=self.reduction) [repeated 3x across cluster]
Training finished iteration 1 at 2023-09-06 09:58:54. Total running time: 2min 20s
训练在2023-09-06 09:58:54完成迭代1。总时长:220+------------------------------+
| Training result              |
+------------------------------+
| time_this_iter_s     130.029 |
| time_total_s         130.029 |
| training_iteration         1 |
+------------------------------+
(TorchTrainer pid=21644) Could not upload checkpoint to c://\Users\lucia\ray_results\TorchTrainer_2023-09-06_09-56-34\TorchTrainer_9b8e5_00000_0_2023-09-06_09-56-34 even after 3 retries.Please check if the credentials expired and that the remote filesystem is supported. For large checkpoints or artifacts, consider increasing `SyncConfig(sync_timeout)` (current value: 1800 seconds).
无法上传检查点到c://\Users\lucia\ray_results\TorchTrainer_2023-09-06_09-56-34\TorchTrainer_9b8e5_00000_0_2023-09-06_09-56-34即使重试3次。请检查凭据是否过期以及是否支持远程文件系统。对于大型检查点或工件,考虑增加' SyncConfig(sync_timeout) '(当前值:1800)(TorchTrainer pid=21644) Last sync command failed: Sync process failed: [WinError 32] Failed copying 'C:/Users/lucia/ray_results/TorchTrainer_2023-09-06_09-56-34/TorchTrainer_9b8e5_00000_0_2023-09-06_09-56-34/checkpoint_000000/.is_checkpoint' to '/Users/lucia/ray_results/TorchTrainer_2023-09-06_09-56-34/TorchTrainer_9b8e5_00000_0_2023-09-06_09-56-34/checkpoint_000000/.is_checkpoint'. Detail: [Windows error 32] 另一个程序正在使用此文件,进程无法访问。
复制失败:因为另一个程序正在使用此文件,进程无法访问。
(TorchTrainer pid=21644) 
(TorchTrainer pid=21644) Could not upload checkpoint to c://\Users\lucia\ray_results\TorchTrainer_2023-09-06_09-56-34\TorchTrainer_9b8e5_00000_0_2023-09-06_09-56-34 even after 3 retries.Please check if the credentials expired and that the remote filesystem is supported. For large checkpoints or artifacts, consider increasing `SyncConfig(sync_timeout)` (current value: 1800 seconds).
无法上传检查点到c://\Users\lucia\ray_results\TorchTrainer_2023-09-06_09-56-34\TorchTrainer_9b8e5_00000_0_2023-09-06_09-56-34即使重试3次。请检查凭据是否过期以及是否支持远程文件系统。对于大型检查点或工件,考虑增加' SyncConfig(sync_timeout) '(当前值:1800)2023-09-06 09:59:20,462	WARNING tune.py:1122 -- Trial Runner checkpointing failed: Sync process failed: GetFileInfo() yielded path 'C:/Users/lucia/ray_results/TorchTrainer_2023-09-06_09-56-34/TorchTrainer_9b8e5_00000_0_2023-09-06_09-56-34', which is outside base dir 'C:\Users\lucia\ray_results\TorchTrainer_2023-09-06_09-56-34'
GetFileInfo()生成路径'C:/Users/lucia/ray_results/TorchTrainer_2023-09-06_09-56-34/TorchTrainer_9b8e5_00000_0_2023-09-06_09-56-34',这是在基础目录'C:\Users\lucia\ray_results\TorchTrainer_2023-09-06_09-56-34'之外

Training completed after 3 iterations at 2023-09-06 09:59:20. Total running time: 2min 45s
训练在3次迭代后于2023-09-06 09:59:20完成。总时长:245

解释

1、想查看TensorBoard文件可以安装:
pip install ray[tune]
错误:Could not install packages due to an OSError: [WinError 5] 拒绝访问。: ‘D:\ANACONDA\anaconda3\Lib\site-packages\google\~rotobuf\internal\_api_implementation.cp39-win_amd64.pyd’
Consider using the --user option or check the permissions.
解决:用pip install --user ray[tune]
结果:

WARNING: Ignoring invalid distribution -rotobuf (d:\anaconda\anaconda3\lib\site-packages)
Installing collected packages: tensorboardX, pyarrow
  WARNING: The script plasma_store.exe is installed in 'C:\Users\lucia\AppData\Roaming\Python\Python39\Scripts' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
Successfully installed pyarrow-6.0.1 tensorboardX-2.6.2.2
WARNING: Ignoring invalid distribution -rotobuf (d:\anaconda\anaconda3\lib\site-packages)
WARNING: Ignoring invalid distribution -rotobuf (d:\anaconda\anaconda3\lib\site-packages)
WARNING: Ignoring invalid distribution -rotobuf (d:\anaconda\anaconda3\lib\site-packages)

2、谷歌的tensorflow的tensorboard是一个web服务器,用于可视化神经网络的训练过程,它可视化标量值、图像、文本等,主要用于保存tensorflow中的事件,比如学习率、loss等。遗憾的是,其他深度学习框架缺少这样的工具,因此,tensorboardX就出现了,这个软件包的目的是让研究人员使用一个简单的界面来记录PyTorch中的事件(然后在tensorboard中显示可视化)。tensorboardX软件包目前支持记录标量、图像、音频、直方图、文本、嵌入和反向传播路径

3、

案例代码(tensorflow)

# -*- coding: utf-8 -*-
"""
Created on Wed Sep  6 11:50:48 2023

@author: lucia
"""
import ray
import tensorflow as tf

from ray.air import session
from ray.air.integrations.keras import ReportCheckpointCallback
from ray.train.tensorflow import TensorflowTrainer
from ray.air.config import ScalingConfig


# If using GPUs, set this to True.
use_gpu = False

a = 5
b = 10
size = 100


def build_model() -> tf.keras.Model:
    model = tf.keras.Sequential(
        [
            tf.keras.layers.InputLayer(input_shape=()),
            # Add feature dimension, expanding (batch_size,) to (batch_size, 1).
            tf.keras.layers.Flatten(),
            tf.keras.layers.Dense(10),
            tf.keras.layers.Dense(1),
        ]
    )
    return model


def train_func(config: dict):
    batch_size = config.get("batch_size", 64)
    epochs = config.get("epochs", 3)

    strategy = tf.distribute.MultiWorkerMirroredStrategy()
    with strategy.scope():
        # Model building/compiling need to be within `strategy.scope()`.
        multi_worker_model = build_model()
        multi_worker_model.compile(
            optimizer=tf.keras.optimizers.SGD(learning_rate=config.get("lr", 1e-3)),
            loss=tf.keras.losses.mean_squared_error,
            metrics=[tf.keras.metrics.mean_squared_error],
        )

    dataset = session.get_dataset_shard("train")

    results = []
    for _ in range(epochs):
        tf_dataset = dataset.to_tf(
            feature_columns="x", label_columns="y", batch_size=batch_size
        )
        history = multi_worker_model.fit(
            tf_dataset, callbacks=[ReportCheckpointCallback()]
        )
        results.append(history.history)
    return results


config = {"lr": 1e-3, "batch_size": 32, "epochs": 4}

train_dataset = ray.data.from_items(
    [{"x": x / 200, "y": 2 * x / 200} for x in range(200)]
)
scaling_config = ScalingConfig(num_workers=2, use_gpu=use_gpu)
trainer = TensorflowTrainer(
    train_loop_per_worker=train_func,
    train_loop_config=config,
    scaling_config=scaling_config,
    datasets={"train": train_dataset},
)
result = trainer.fit()
print(result.metrics)

运行结果

1、报错:A Message class can only inherit from Message
解决:
在spyder中将 Tools->preferences -> python interpreter中的User Module Reloader关掉。
将Enable UMR、Show reloaded modules list 选项取消!!!
然后restart kernel即可!

2、

TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
 1. Downgrade the protobuf package to 3.20.x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates

解决方案:
卸载protobuf 已经安装的版本
pip uninstall protobuf
安装3.19.0版本
pip install protobuf==3.19.0

3、重启内核,重新运行,如果再失败,再重新运行

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/1034249.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

DT 卡通材质学习 一

渐变着色器 相交线 笔刷和卡通结合使用 修改器

停车场系统源码

源码下载地址&#xff08;小程序开源地址&#xff09;&#xff1a;停车场系统小程序&#xff0c;新能源电动车充电系统&#xff0c;智慧社区物业人脸门禁小程序: 【涵盖内容】&#xff1a;城市智慧停车系统&#xff0c;汽车新能源充电&#xff0c;两轮电动车充电&#xff0c;物…

VSCode远程连接服务器报错:Could not establish connection to

参考&#xff1a;https://blog.csdn.net/weixin_42538848/article/details/118113262 https://www.jb51.net/article/219138.htm 刚开始把ssh文件夹中的known_hosts给删除了&#xff0c;发现没啥用。 之后在扩展Remote-SSH里面&#xff0c;把config file路径设置为ssh文件夹里…

壁炉的智能化:现代设计师的创新挑战

壁炉一直以来都是家庭的焦点之一&#xff0c;不仅因为它们提供了温暖&#xff0c;更因为它们在室内空间中的装饰价值。然而&#xff0c;如今的壁炉不再仅仅是传统的取暖设备&#xff0c;它们变得更加智能化&#xff0c;提供了更多的功能和便利性。对于室内设计师来说&#xff0…

Nginx图片防盗链

原理 浏览器向web服务器发送请求时一般会在header中带上Referer信息&#xff0c;服务器可以借此获得一些信息用来处理盗链 不过Referer头信息其实是可以伪装生成的&#xff0c;所以通过Referer信息防盗链并非100%可靠 具体方法 核心点就是在Nginx配置文件中&#xff0c;加入…

我的Qt作品(19)使用Qt写一个轻量级的视觉框架---第2章,仿海康VM实现思维导图拖拽方式的算法流程图

上次写的第1章介绍了主界面的设计。 https://blog.csdn.net/libaineu2004/article/details/130277151 本次是第2章&#xff0c;主要介绍流程图的运行。 目前市面上视觉框架很多&#xff0c;主要有列表图方式和流程图方式。海康VM的流程图方式比较受用户的喜爱和欢迎&#xf…

记一次失败的pip使用经历

python如何使用pip工具下载第三方库&#xff1f; 首先&#xff0c;安装并配置好python和pip的环境&#xff0c;特别注意pip放在python的script文件下&#xff0c;有pip和pip3两种&#xff0c;选择pip3版本。如下图所示。 然后打开命令行窗口&#xff0c;检查python和pip工具是…

iterator和generator

iterator和generator iterator es6: let/const ...展开 迭代器 是一种机制&#xff0c;比如在控制台输出Iterator是没有这个类的&#xff0c;为不同的数据结构提供迭代循环的机制。 迭代器对象&#xff1a;具备next方法&#xff0c;next能够对你指定的数据进行迭代循环&#x…

Vue 的组件加载顺序和渲染顺序

1、结论先行 组件的加载顺序是自上而下的&#xff0c;也就是先加载父组件&#xff0c;再递归地加载其所有的子组件。 而组件渲染顺序是按照深度优先遍历的方式&#xff0c;也就是先渲染最深层的子组件&#xff0c;再依次向上渲染其父组件。 2、案例 下面是一个简单的示例代…

灰度变换 几种常见的空间滤波,例如均值、中值滤波(数字图像处理概念 P3)

文章目录 背景知识 & 一些基础的变换直方图处理 ★均值滤波器中值滤波器锐化空间滤波器 增强的首要目标是处理图像&#xff0c;使其更适合某些应用 图像质量的视觉评价是一种高度主观的过程 背景知识 & 一些基础的变换 直方图处理 ★ 均值滤波器 中值滤波器 锐化空间滤…

接口自动化测试之Requests模块详解

Python中&#xff0c;系统自带的urllib和urllib2都提供了功能强大的HTTP支持&#xff0c;但是API接口确实太难用了。Requests 作为更高一层的封装&#xff0c;在大部分情况下对得起它的slogan——HTTP for Humans。 让我们一起来看看 Requests 这个 HTTP库在我们接口自动化测试…

关键点检测 HRNet网络详解笔记

关键点检测 HRNet网络详解笔记 0、COCO数据集百度云下载地址1、背景介绍2、HRNet网络结构3、预测结果&#xff08;heatmap&#xff09;的可视化3、COCO数据集中标注的17个关键点4、损失的计算5、评价准则6、数据增强7、模型训练 论文名称&#xff1a; Deep High-Resolution Rep…

Parasoft Jtest 2023.1

Parasoft Jtest 2023.1 2692407267qq.com&#xff0c;更多内容请见http://user.qzone.qq.com/2692407267/

知识图谱:信息抽取简易流程

目录 一、标注训练数据 二、训练数据模型 三、实现NER 一、标注训练数据 使用工具:Brat ## BRAT安装 0、安装条件 (1)运行于Linux系统 (2)brat(v1.3p1)仅支持python2版本运行使用,否则会报错 File "standalone.py", line 257except SystemExit, sts:^Syn…

探索最佳建筑工程项目管理软件,提高效率与协作

相比于其他行业的项目管理&#xff0c;建筑工程项目管理的周期一般更长&#xff0c;涉及部门更多&#xff0c;传统的管理方式无法照顾到方方面面。因此越来越多的工程团队希望能通过现代化数据管理工具来协助自己进行建筑工程项目管理。 正所谓有需求就有市场&#xff0c;目前市…

使用SSH连接虚拟机一直提示填写密码

查看ssh服务是否开启 service ssh status 上面的报错&#xff0c;查看ssh、sshd是否都已安装&#xff1a;ps -e| grep ssh 这里显示没有安装sshd 安装sshd&#xff1a;sudo apt-get install openssh-server centos和ubantu的安装指令不一样&#xff0c;centos是使用yum指令进…

el-table 指定层级展开

先来看看页面默认全部展开时页面的显示效果&#xff1a;所有节点被展开&#xff0c;一眼望去杂乱无章&#xff01; 那么如何实现只展开指定的节点呢&#xff1f;最终效果如下&#xff1a;一眼看去很舒爽。 干货上代码&#xff1a; <el-table border v-if"refreshTabl…

C++ Primer 第5章 语句

C Primer 第5章 语句 5.1 简单语句一、空语句二、别漏写分号&#xff0c;也别多写分号三、复合语句&#xff08;块&#xff09; 5.2 语句作用域5.3 条件语句5.3.1 if语句一、使用if else语句二、嵌套if语句三、注意使用花括号四、悬垂else五、使用花括号控制执行路径 5.3.2 swi…

力扣-219.存在重复元素||

Idea 使用哈希表来辅助存储&#xff0c;key存储nums数组中的值nums[i]&#xff0c;value存储该值在nums数组的下标 i&#xff0c;然后遍历nums数组&#xff0c;未出现的重复的存储在hashmap中 AC Code class Solution { public:bool containsNearbyDuplicate(vector<int>…