ARM_基础之RAS

news2025/1/11 2:36:49

Reliability, Availability, and Serviceability (RAS), for A-profile architecture
源自 https://developer.arm.com/documentation/102105/latest/

1 Introduction to RAS

1.1 Faults,Errors,and failures

三个概念的区分:

A failure is the event of deviation from correct service. This includes data corruption, data loss, and service loss.
An error is the deviation from correct service. An incorrect value that has an error is corrupt.
A fault is the cause of the error.

There are many sources of faults in a system, including both software and hardware faults:
• Hardware faults originate in, or affect, hardware.
• Software faults affect software, that is programs or data.

The RAS Extension and RAS System Architecture primarily address errors produced from hardware faults. These fall into two main areas:
• 1. Transient faults.
• 2. Non-transient or persistent faults.

1.2 General taxonomy of errors(错误分类)

1.2.1 Error detection
When a component accesses memory or other state, an error might be detected in that memory or state.
The error might be corrected or deferred by the component, or signaled to another component as either a deferred error or a detected error.

1.2.2 Error propagation
An error is propagated by deviations from correct service, including when any of the following occurs that would not have been permitted to occur had the fault not been activated:
1)错误传播的场景有如下:

• 1. A corrupt value is passed from producer to consumer.
一个损坏的值从生产者传递给消费者

• 2. A transaction or other operation occurs that should not have occurred.
发生了不应该发生的事务或其他操作

• 3. A transaction or other operation that should have occurred does not occur.
本应发生的事务处理或其他操作没有发生

• 4. A loss of uniprocessor semantics or any other loss of coherency in a multiprocessor coherent system is observed.
多核处理器系统中有一致性损失的行为

• 5. Changing the timing and/or order of transactions or other operations such that the timing and/or order of those transactions or operations is incorrect. In this case, the service interface defines acceptable timings and/or orders for transactions and other operations.
改变了 timing 或者 transactions 的顺序

An error is silently propagated by the producer of a transaction if the consumer of the transaction cannot detect the error and consumes an undetected error because of the transaction. This might be because of one of the following:
2)错误被 Producer 静默传播的原因有如下
• 1. The error is present on the transaction, but was not detected by the producer. The error is silently propagated by the producer.
Transaction中存在该错误,但生产者没有检测到该错误,错误由生产者无声地传播

• 2. The error is present on the transaction, but was not signaled to the consumer as an error. For example, a corrupt value was passed in the transaction with no indication that it was corrupt. The error is silently propagated by the producer.
该错误存在于事务中,但没有将其作为错误的信号发送给消费者。例如,在事务中传递了一个损坏的值,但没有显示它已损坏。错误由生产者无声地传播。

如上两者的差别是,第一种是 Producer 也检测不出来,所以传播下去了;另一种是 Producer 没有做错误标记给到 Consumer 传播了下去。

Errors might be propagated by components in a system until one of the following occurs:
3)错误可能由系统中的组件传播,直到发生以下情况之一为止
• They are masked and do not affect the outcome of the system.
The error might be masked because a corrupt value is discarded or overwritten, or the error is detected and removed.
它们被 Masked 了,并且不会影响系统的结果,错误可能被丢弃或覆盖,或者错误被检测并删除。

• They affect the service interface of the system and possibly cause failure. If the error has been silently propagated to the service interface then:
– This is a Silent Data Corruption (SDC).
– The rate of such failures, measured as the number of failures per billion device-hours of operation, is called the SDC Failure-in-Time (FIT) rate.
Alternatively, the error might have been detected, causing the system to invoke error handling and recovery.
它们会影响系统的服务接口,并有可能导致故障。如果错误已静默传播到服务接口,则:
– 这是静默数据损坏(SDC, Silent Data Corruption)
– 这种故障率,以每十亿个设备运行小时的故障数来衡量,称为SDC实时故障(FIT,Failure-in-Time)率

1.2.3 Infected and poisoned

The state of a component becomes infected when the component consumes an uncorrected error that updates
the state.
当组件使用一个更新状态的未更正错误时,该组件的状态将受到感染

A value is poisoned in the state of a component if it is marked as being in error, such that a subsequent access of
the state will detect the value is so marked and is treated as a detected error.
如果一个值被标记为错误,则它在组件状态下poisoned,这样该状态的后续访问将检测到该值被标记并被视为检测到的错误

Poison is used to defer an error.
Poison 是用来延缓错误的

1.2.4 Containable and uncontainable(可控制和不可控制)

An undetected error is uncontained at the component that failed to detect it.
未检测到的错误对于未能检测到它的组件而言是 不可控制的

A silently propagated error is uncontained at the component that silently propagated it.
静默传播的错误是 不可控制的

A detected uncorrected error is uncontainable at the component if it might be uncontained at the component.
检测到不可纠正的错误,对于组件来说也是不可控制的

A detected uncorrected error is containable at the component if it is not uncontainable at the component. If
the component cannot determine whether a detected uncorrected error is uncontainable or containable at the
component, then the component treats the detected uncorrected error as uncontainable at the component.

An error that is uncontainable at a component might be containable at the system level.
组件上无法控制的错误可能在系统级别上控制

Note:
Reporting an error as containable allows software to contain the error. This does not mean that hardware has
contained the error
报告一个可包含的错误允许软件包含该错误。这并不意味着硬件已经包含了这个错误

1.3 Techniques for improving reliability, availability, and serviceability

1.3.1 Fault prevention and fault removal(故障预防和故障排除)
Fault prevention and fault removal are two techniques for handling faults. Fault prevention and fault removal
mechanisms are IMPLEMENTATION DEFINED.

Fault prevention techniques are outside the scope of the architecture.
故障预防技术超出了体系结构的范围

A fault that is removed is a corrected error and might be recorded and generate a fault handling interrupt, but it
is not propagated. This means that it is not consumed and does not cause service failure.
故障排除 – 举例:一个纠正的错误,可能被记录并产生一个故障处理中断,但它没有传播。这意味着它没有被使用,也不会导致服务失败

A common technique to detect and correct errors is the use of an Error Detection and Correction Code (EDAC),
more commonly referred to as simply an Error Correction Code (ECC). ECC schemes use mathematical codes
to detect and correct an error in a value in memory. The size of the value is the protection granule for the ECC
scheme.
检测和纠正错误的一种常见技术是使用错误检测和校正代码(EDAC),这通常被称为简单的错误校正代码(ECC)。ECC方案使用数学代码来检测和纠正内存中的一个值中的错误。该值的大小为ECC方案的保护颗粒。

The RAS Extension and RAS System Architecture do not require implementation any fault removal schemes,
including ECC
RAS扩展和RAS系统体系结构不需要实现任何故障消除方案,包括ECC

1.3.2 Error handling and recovery(错误处理和恢复)
A fault that is not removed gives rise to an uncorrected error.
未消除的故障会导致不纠正的错误(1bit ECC积累成 2bit ECC错误)

Error recovery is the process by which software and hardware minimize the impact of an uncorrected error.
错误恢复是指软件和硬件尽量减少未纠正错误的影响的过程

Error recovery methods include:
错误恢复方法包括:
• Deferring an error from a fault. An error is deferred by hardware if hardware can make forward progress
without consuming the error. Deferring the error means(延迟错误意味着):

– 1. The fault might become masked later (fault removal). For example, because the corrupt value is
overwritten before it is consumed.
故障可能稍后masked(故障排除),例如,因为损坏的值在 consumed 之前被 Overwritten

– If the deferred error is later consumed, then the error is reported at the point of consumption. For
example, if the deferred error is consumed by a Processing element (PE) then the consumer PE
generates an error exception. This can give better results in terms of error recovery in the case where
the original producer of the data is not known when the error was deferred. For example because a
latent error was detected.
如果稍后 Consumed 了延迟错误,则会在消耗点报告该错误。
例如,如果延迟错误被处理元素(PE)消耗,则消费者PE将生成一个错误异常。
在错误被延迟时不知道数据的原始生产者的情况下,这可以在错误恢复方面提供更好的结果。例如,因为检测到了一个潜在的错误

A common technique to defer an error is to replace the corrupt value with a poisoned value, for example in
memory or in a transaction.
延迟错误的一种常见技术是用 poisoned 的值替换损坏的值,例如在内存或 transaction 中。

• Preventing further propagation of the error, that is containing the error. In particular, preventing silent
propagation of the error.
防止错误的进一步传播,即包含该错误。特别是,防止错误的无声传播

• Reducing the severity of a failure by invoking a service failure mode:
– This is a Detected Uncorrected Error (DUE).
– The rate of such failures gives the DUE FIT rate.
– The type of service failure mode depends on what is acceptable to the service.

A software error recovery agent is typically invoked when hardware detects an error it cannot correct, defer, or
remove.
当硬件检测到一个无法纠正、延迟或删除的错误时,通常会调用软件错误恢复代理

An error recovery agent also provides information to the operator through error logs to improve serviceability,
for example to help with the identification of a Field Replaceable Unit (FRU).
错误恢复代理还通过错误日志向操作员提供信息,以提高可服务性,例如,帮助识别现场可替换单元(FRU)。

The RAS Extension and RAS System Architecture provide optional common programmers’ models to record
information about an error in an error record.
RAS扩展和RAS系统体系结构提供了可选的通用程序员模型,以记录错误记录中有关错误的信息。

The RAS Extension describes the behavior of a PE when an error is signaled to it by the system, including
invoking a service failure mode by taking an error exception, and optional mechanisms to limit propagation of
an error.
RAS扩展描述了当系统向错误发出信号时PE的行为,包括通过采取错误异常调用服务失败模式,以及限制错误传播的可选机制。

The RAS Extension and RAS System Architecture do not require systems to implement error recovery
mechanisms, including poison, and do not require systems to limit the silent propagation of errors.
RAS扩展机制和RAS系统体系结构不要求系统实现错误恢复机制,包括毒药机制,也不要求系统限制错误的静默传播。

1.3.3 Fault handling
Fault handling by software is the process by which software diagnoses and responds to faults to improve
availability.
软件故障处理是指软件诊断故障并响应故障以提高可用性的过程

Fault handling methods include:
故障处理方法包括

• 1. Predictive Failure Analysis (PFA), using information recorded by hardware to trigger pre-emptive action.
预测性故障分析(PFA),使用硬件记录的信息来触发先发制人的行动

The RAS Extension and RAS System Architecture provide optional mechanisms to allow the reporting of errors
and warnings to a fault handling agent, and to record information about the fault in an error record. It is the
responsibility of the error recovery and fault handling processes to collate the error record data and write it to an
error log.
RAS扩展和RAS系统体系结构提供了可选的机制,以允许向错误处理代理报告错误和警告,并在错误记录中记录有关错误的信息。错误恢复和错误处理过程的责任是整理错误记录数据,并将其写入错误日志

The detailed nature of the fault handling agent is outside the scope of this architecture. Fault handling and error
recovery might be independent agents
故障处理代理的详细性质超出了此体系结构的范围。故障处理和错误恢复可能是独立的代理

2 RAS Extension for A-profile

2.1 PE error handling

2.1.1 PE error detection
When a PE accesses memory or other state, an error might be detected in that memory or state, and corrected,
deferred, or signaled to the PE as a detected error with an in-band error response.
当PE访问内存或其他状态时,可能在该内存或状态中检测到错误,并通过带内错误响应纠正、延迟或信号给PE

When an error is detected by a component on a read or a cache maintenance operation from the PE:
1)当组件在从PE执行读取或高速缓存维护操作时检测到错误时:

– 1. If the error can be corrected, it is corrected and corrected data is returned.
如果错误可以纠正,则被纠正并返回纠正后的数据

– 2. If the error cannot be corrected and can be deferred, it is deferred. For example, on a load by poisoning
the PE state, if this is supported by the PE implementation.
如果错误不能纠正且可以延迟,则会延迟;例如,在一个负载上,如果PE实现支持它,则通过 Poisoning PE状态

– If the error cannot be corrected and if implemented and enabled at the component, the detected error
is signaled to the PE as an in-band error response.
如果错误无法被纠正,如果在组件上实现和启用,检测到的错误将作为带内错误响应发送给PE

When an error is detected by a component consuming a write from the PE:
2)当使用从PE写入的组件检测到错误时:

– If the error can be corrected, it is corrected.
如果这个错误可以纠正,它就可以纠正

– If the error cannot be corrected and can be deferred, it is deferred to the consumer. For example, by
poisoning the location being written.
如果错误不能被纠正,并且可以延迟,则会延迟给消费者。例如,通过 Poisoning 到被写入的位置

– If the error cannot be corrected and if implemented and enabled at the component, the detected error
is signaled to the PE as an in-band error response.
如果错误无法被纠正,如果在组件上实现和启用,检测到的错误将作为带内错误响应发送给PE

2.1.2 PE error propagation
The program-visible architectural state of the PE, referred to as the PE state, includes:
• General-purpose, SIMD&FP, and SVE registers.
• System registers.
• Special-purpose registers.
• PSTATE.

An error is consumed by the PE by any of the following:
1)PE被以下任何一个项一个错误 Consumed:

• 1. An instruction commits the corruption into the PE state.
指令会将损坏提交到PE状态

• 2. The error is on an instruction fetch and the corrupt instruction is committed for execution.
错误在指令获取上,损坏的指令被提交执行

• 3. The error is on a translation table walk for a committed load, store, or instruction fetch.
错误已经位于提交加载、存储或指令获取的转换表中

An error is propagated by the PE by one or more of the following occurring that would not have been permitted
to occur had the fault not been activated:
2)PE通过以下一个或多个事件传播错误,如果故障没有被激活,就不允许发生这些错误:

• Consumption of the corrupt value by any instruction, propagating the error to the target(s) of the instruction.
This includes:
通过任何指令 Consumered 损坏的值,将错误传播到指令的目标值,这包括:

– A store of a corrupt value.
一个损坏值的写

– A write of a corrupt value to a System register, Special-purpose register, or PSTATE. Infecting a
System register state might mean that the PE generates transactions that would not otherwise be
permitted.
一个写,到了系统寄存器、特殊用途寄存器或PSTATE的损坏值。感染系统注册状态可能意味着PE生成以其他方式不被允许的 transaction

• Any operation occurring that should not have occurred, including:
任何不应该发生的操作,包括:

– 1. A load, translation table walk, or instruction fetch that would not have been permitted, including those
from hardware speculation or prefetching.
不允许的加载、转换表行走或指令获取,包括那些来自硬件猜测或预取的获取

– 2. A store to an incorrect address, or a store that would not have been made or not permitted.
地址错误的写,或者不会创建或不允许的写

– 3. A direct or indirect write to a Special-purpose or System register that would not have been made or
not permitted.
直接或间接写入特殊目的或系统寄存器的文件,不允许或不允许

– 4. Assertion of any signal, such as an interrupt, that would not have been asserted.
对不会被断言的任何信号,如中断的断言

• Any operation not occurring that should have occurred.
任何没有发生的本应该发生的操作。

• Causing the PE to take an imprecise exception, other than an error exception in response to the error itself.
See the section Definition of a precise exception in the Arm® Architecture Reference Manual, for A-profile
architecture.
导致PE采取不精确的异常,而不是响应错误本身的错误异常

• The PE discarding data that it holds in a modified state.
PE丢弃它在修改状态下保存的数据

• Any other loss of required uniprocessor semantics, ordering, or coherency
所需的单处理器语义、顺序或一致性的任何其他损失

An error propagated by the PE is silently propagated by the PE only if all of the following are true:
只有当以下所有错误均为真时,PE传播的错误才会由PE静默传播:

  1. The propagation is not part of the required operation of the PE in taking an error exception generated by
    the error.
    该传播不是PE在接受由该错误产生的错误异常时所需的操作的一部分。

  2. The propagation is not part of the required operation of the PE executing an ESB instruction that
    synchronizes the error.
    传播不是PE执行同步错误的ESB指令所需操作的一部分

  3. The error is not signaled to the consumer as a detected error or deferred error.
    该错误不会作为检测到的错误或延迟错误发送给使用者

  4. Any of the following are true:
    • The corrupt value is held in other than the general-purpose, SIMD&FP, or SVE registers.
    损坏值保存在 general-purpose、SIMD&FP或SVE寄存器中

• The error is propagated by an instruction in program order before either taking an error exception
generated by the error or executing an ESB instruction that synchronizes the error, and is propagated
to outside of the general-purpose, SIMD&FP, or SVE registers
在错误接受由错误产生的错误异常或执行同步错误的ESB指令之前,错误通过程序顺序的指令传播,并传播到通用、SIMD&FP或SVE寄存器之外

• The error is propagated other than by an instruction that consumes the corrupt value as an input
operand but otherwise behaves correctly.
错误的传播方式不是指令将损坏的值作为输入操作数,但其他指令行为正确

2.1.3 Other errors – 2024.03.17 下周从这里开始
在这里插入图片描述

2.2 Generating error exceptions

2.3 Taking error exceptions

2.4 Error synchronization event

2.5 Virtual SError interrupts

2.6 Error records in the PE

3 RAS System Architecture

3.1 Nodes

3.2 Detecting and consuming errors

3.3 Standard error record

3.4 Error recovery interrupt

3.5 Fault handling interrupt

3.6 In-band error response signaling (external aborts)

3.7 Critical error interrupt

3.8 Standard format Corrected error counter

3.9 Error recovery, fault handling, and critical error signaling

3.10 Error recovery reset

3.11 Timestamp extension

3.12 Common Fault Injection Model Extension

4 RAS Extension and RAS System Architecture Registers

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/1524585.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

IDEA系列软件设置自动换行

以pycharm软件为例,我们在编程的时候常常会遇到这种情况,内容过长导致超出pycharm的界面,导致我们阅读浏览起来非常的不方便,对于这种情况,我们可以通过给IDEA软件设置自动换行来解决 首先打开setting,找到…

MySQL_数据库图形化界面软件_00000_00001

目录 NavicatSQLyogDBeaverMySQL Workbench可能出现的问题 Navicat 官网地址: 英文:https://www.navicat.com 中文:https://www.navicat.com.cn SQLyog 官网地址: 英文:https://webyog.com DBeaver 官网地址&…

odoo17开发教程(6):用户界面UI的交互-创建Action

前面的文章中我们已经创建了新模型及其相应的访问权限,是时候与用户界面进行交互了。 数据文件(XML) 在上一篇文章中,我们通过 CSV 文件添加数据。当要加载的数据格式简单时,CSV 格式很方便。当格式比较复杂时&#x…

Javaweb学习记录(一)Maven

Maven是一款Java项目管理工具,下面将介绍Maven的实际作用和相关的操作 Maven项目依赖的添加 在Maven项目中添加依赖,通过dependencies标签添加所有依赖,所有依赖都添加在里面,而单个依赖就使用dependency标签添加进项目&#xf…

【数据结构入门】顺序表详解(增删查改)

目录 顺序表的基本概念 动态顺序表的实现 初始化 插入 尾插法 头插法 指定位置之前插入 删除 尾删法 头删法 指定位置删除 查找 销毁 顺序表的基本概念 什么是顺序表? 顺序表是用一段物理地址连续的存储单元依次存储数据元素的线性结构,一般…

elasticsearch基础学习

elasticsearch简介 什么是elasticsearch elasticsearch(简称es),其核心是 Elastic Stack,es是一个基于 Apache Lucene(TM)的开源的高扩展的分布式全文检索引擎,它可以近乎实时的存储、检索数据…

软考 系统架构设计师之回归及知识点回顾(7)

接前一篇文章:软考 系统架构设计师之回归及知识点回顾(6) 11. 云计算 背景 大数据和云计算已成为IT领域的两种主流技术。“数据是重要资产”这一概念已成为大家的共识,众多公司争相分析、挖掘大数据背后的重要财富。同时学术界、…

深度学习pytorch——Broadcast自动扩展

介绍 在 PyTorch 中,Broadcast 是指自动扩展(broadcasting)运算的功能。它允许用户在不同形状的张量之间执行运算,而无需手动将它们的形状改变为相同的大小。当进行运算时,PyTorch 会自动调整张量的形状,使…

数据结构的基本框架以及泛型

目录 集合框架复杂度大O的渐进表示法 装包(箱)或者拆包(箱)装包拆包 泛型泛型的上界泛型方法求最大值 集合框架 Java的集合框架,Java Collection Framework 又被称为容器container, 定义在java.util包下的一组 interfaces 和其实现类 classes interface: 接口 abstracb class…

RuoYi-Vue开源项目3-登录操作代码解析

登录操作代码解析 前端代码详解 // 1. 登录按钮点击触发登录事件 handleLogin<el-button:loading"loading"size"medium"type"primary"style"width:100%;"click.native.prevent"handleLogin"><span v-if"!load…

ElasticSearch:数据的魔法世界

​ 欢迎来到ElasticSearch的奇妙之旅&#xff01;在这个充满魔法的搜索引擎世界中&#xff0c;数据不再是沉闷的数字和字母&#xff0c;而是变得充满活力和灵动。无论你是刚刚踏入数据探索的小白&#xff0c;还是已经对搜索引擎有所了解的行者&#xff0c;本篇博客都将为你揭示…

【从零开始学习数据结构 | 第一篇】树

目录 前言&#xff1a; 树&#xff1a; 树结点之间的关系描述&#xff1a; 树的常见属性&#xff1a; 森林&#xff1a; ​编辑树的性质&#xff1a; 总结&#xff1a; 前言&#xff1a; 当谈论数据结构时&#xff0c;树&#xff08;Tree&#xff09;是一种极为重要且常…

React三大属性---state,props,ref

react的三大属性 react的三大属性分别是state props 和ref 是传递数据的重要内容 State state是组件对象最重要的属性 包含多个key-value的组合呢 存在于组件实例对象中 基本使用 此时demo是通过onClick的回调 所以this是undefined 本来应该是window 但是局部开启了严格模…

百科源码生活资讯百科门户类网站百科知识,生活常识

百科源码生活资讯百科门户类网站百科知识,生活常识 百科源码安装环境 支持php5.6&#xff0c;数据库mysql即可&#xff0c;需要有子目录权限&#xff0c;没有权限的话无法安装 百科源码可以创建百科内容&#xff0c;创建活动内容。 包含用户注册&#xff0c;词条创建&#xff…

数学与计算机(2)- 线性代数

原文&#xff1a;https://blog.iyatt.com/?p13044 1 矩阵 NumPy 中 array 和 matrix 都可以用于储存矩阵&#xff0c;后者是前者的子类&#xff0c;array 可以表示任意维度&#xff0c;matrix 只能是二维&#xff0c;相当于矩阵专用&#xff0c;在一些矩阵的运算操作上较为直…

Mac屏幕录制编辑软件

以下是一些Mac平台上受到推荐和好评的屏幕录制编辑软件&#xff1a; OBS Studio&#xff08;免费且开源&#xff09;&#xff1a; OBS 是一款功能强大的免费屏幕录制工具&#xff0c;不仅限于游戏直播&#xff0c;也适用于各种屏幕录制需求。它允许用户捕获屏幕、摄像头、音频&…

关于MySQL数据库的学习3

目录 前言: 1.DQL&#xff08;数据查询语言): 1..1基本查询&#xff1a; 1.2条件查询&#xff1a; 1.3排序查询&#xff1a; 1.3.1使用ORDER BY子句对查询结果进行排序。 1.3.2可以按一个或多个列进行排序&#xff0c;并指定排序方向&#xff08;升序ASC或降序DESC&#…

C语言 02 安装

C 语言的编译器有很多&#xff0c;其中最常用的是 GCC&#xff0c;这里以安装 GCC 为例。 Windows 这里以 Windows 11 为例 官方下载地址&#xff1a;https://www.mingw-w64.org/ 选择 Downloads 选择 Windows 的 GCC 环境 MingW-W64-builds 选择 GitHub 根据操作系统位…

IP复习实验(gre)

拓扑图(r6相当于公网设备) 要求r1,r2,r3之间是hub-spoke架构 。r1,r4,r5是全连接&#xff0c;最后做OSPF跑通 第一步把所以接口配置了 第二步配置缺省0.0.0.0 0 n6.1.1.6 &#xff08;n就是具体的路由器r1 n就是1&#xff09; 测试一下先保证公网通畅 就先配置hub架构的 hu…

arp动态表缓存清除

一、arp表里清除表状态&#xff1a; 1&#xff0c;Delay&#xff1a;请求arp 2&#xff0c;Reachab&#xff1a;响应arp 3&#xff0c;Stale此状态下&#xff0c;待gc_stale_time超时后&#xff0c;准备gc_interval定期清理 二、限制条件 base_reachable_time&#xff1a;后变…