文章目录
- PCIe Error
- AER (Advanced Error Reporting)
- DPC (Downstream Port Containment)
处理器上错误通常可分为detected和undetected error。Undetected errors可能变得良性(benign),也可能导致系统故障如silent data corruptions (SDC)。Detected errors则又可分为correctable errors (CE) 和uncorrectable errors (UCE)。
PCIe定义了两种错误报告范式:基线功能和高级错误报告功能(AER)。本文介绍PCIe错误定义及AER/DPC功能。
PCIe Error
PCIe错误分类为Uncorrectable errors和Correctable errors。Uncorrectable错误严重性可以进一步分为fatal和nonfatal。
有3种错误信号机制:
-
CPL status
-
in-band error message:
Routed-to-RC, reqid标识错误设备BDF,CPL通常返回UR/CA指示uncorrectable error,AER发出error message,RP支持AER记录到Root Error Status寄存器。
-
error forwarding (data poisoning)
EP位指示,non-fatal和fatal error需要启用command.SERR#ENABLE,error message发送由device control bit[3:0]控制。
AER (Advanced Error Reporting)
PCIe AER扩展的功能结构提供更强大的错误报告。可将uncorrectable error编程为fatal或non-fatal。若severity置位,则为faral error,否则为non-fatal error。仅报告最重要错误,遵循优先级列表。
Controller接收TLP detect到错误后的步骤如下:
- Discard TLP
- Generate a CA/UR Cpl (for NP)
- Set status in PCI-compatible status register
- Set status in AER registers (when enable AER)
- Generate an error MSG (USP only)
- For malformed TLPs credit is returned based on the buffer space which has been consumed by the TLP
AER初始化需要启用以下域段:
AER driver用于在发生错误时收集全面的错误信息,报告错误并执行错误恢复操作。
static irqreturn_t aer_irq(int irq, void *context)
{
struct pcie_device *pdev = (struct pcie_device *)context;
struct aer_rpc *rpc = get_service_data(pdev);
struct pci_dev *rp = rpc->rpd;
int aer = rp->aer_cap;
struct aer_err_source e_src = {};
pci_read_config_dword(rp, aer + PCI_ERR_ROOT_STATUS, &e_src.status);
if (!(e_src.status & AER_ERR_STATUS_MASK))
return IRQ_NONE;
pci_read_config_dword(rp, aer + PCI_ERR_ROOT_ERR_SRC, &e_src.id);
pci_write_config_dword(rp, aer + PCI_ERR_ROOT_STATUS, e_src.status);
if (!kfifo_put(&rpc->aer_fifo, e_src))
return IRQ_HANDLED;
return IRQ_WAKE_THREAD;
}
aer_isr()继续根据severity读取status和mask,并把status和id等信息输出。
static irqreturn_t aer_isr(int irq, void *context)
{
struct pcie_device *dev = (struct pcie_device *)context;
struct aer_rpc *rpc = get_service_data(dev);
struct aer_err_source e_src;
if (kfifo_is_empty(&rpc->aer_fifo))
return IRQ_NONE;
while (kfifo_get(&rpc->aer_fifo, &e_src))
aer_isr_one_error(rpc, &e_src);
return IRQ_HANDLED;
}
static void aer_isr_one_error(struct aer_rpc *rpc,
struct aer_err_source *e_src)
{
struct pci_dev *pdev = rpc->rpd;
struct aer_err_info e_info;
pci_rootport_aer_stats_incr(pdev, e_src);
/*
* There is a possibility that both correctable error and
* uncorrectable error being logged. Report correctable error first.
*/
if (e_src->status & PCI_ERR_ROOT_COR_RCV) {
e_info.id = ERR_COR_ID(e_src->id);
e_info.severity = AER_CORRECTABLE;
if (e_src->status & PCI_ERR_ROOT_MULTI_COR_RCV)
e_info.multi_error_valid = 1;
else
e_info.multi_error_valid = 0;
aer_print_port_info(pdev, &e_info);
if (find_source_device(pdev, &e_info))
aer_process_err_devices(&e_info);
}
if (e_src->status & PCI_ERR_ROOT_UNCOR_RCV) {
e_info.id = ERR_UNCOR_ID(e_src->id);
if (e_src->status & PCI_ERR_ROOT_FATAL_RCV)
e_info.severity = AER_FATAL;
else
e_info.severity = AER_NONFATAL;
if (e_src->status & PCI_ERR_ROOT_MULTI_UNCOR_RCV)
e_info.multi_error_valid = 1;
else
e_info.multi_error_valid = 0;
aer_print_port_info(pdev, &e_info);
if (find_source_device(pdev, &e_info))
aer_process_err_devices(&e_info);
}
}
DPC (Downstream Port Containment)
在Downstream Port检测到unmasked uncorrectable error后停止流量,避免数据损坏潜在扩散,并支持CER (Containment Error Recovery)。DPC触发不会被视为错误,但可将其视为correctable error。DPC触发时支持发起中断或ERR_COR消息,流程如下:
DPC trig -> DPC interrupt & DPC trig status/reason -> disable LTSSM
SW clear DPC status -> LTSSM to Detect state -> link retrain
DPC Trigger Enable
- 00b: 默认关闭
- 01b: 启用并在ERR_FATAL触发;
- 10b: 启用并在ERR_NONFATAL/ERR_FATAL触发
DCP Completion Control: 0-CA, 1-UR
DPC Interrupt: DPC触发时发起INTx/MSI/MSI-X
DPC ERR_COR: DPC触发时发起ERR_COR msg,独立于中断
Poisoned TLP Egress Blocking: 不得传输TLP,若未触发DPC返回UR Cpl,否则DPC期间不再接收并丢弃TLP
Software Trigger: 写1触发DPC,读恒为0,可用于disable link,优先于MSI/MSI-X
DL_Active ERR_COR: DL转换为DL_Active时,DSP发起ERR_COR,不会作为错误处理
DPC RC busy: 指示软件RP需保持在DPC揭制状态
AER Uncorrectable Error Status 会根据 DPC Trigger Enable/Status和更新
Root Port Programmed I/O (RP PIO) Error Controls (eDPC)
用于精细管理RP NP遇到的错误 (CFG/IO/Mem UR/CA/CTO),建议与AER配置同步。若severity置位,作为UCE处理,触发DPC。
static irqreturn_t dpc_irq(int irq, void *context)
{
struct pci_dev *pdev = context;
u16 cap = pdev->dpc_cap, status;
pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, &status);
if (!(status & PCI_EXP_DPC_STATUS_INTERRUPT) || PCI_POSSIBLE_ERROR(status))
return IRQ_NONE;
pci_write_config_word(pdev, cap + PCI_EXP_DPC_STATUS,
PCI_EXP_DPC_STATUS_INTERRUPT);
if (status & PCI_EXP_DPC_STATUS_TRIGGER)
return IRQ_WAKE_THREAD;
return IRQ_HANDLED;
}
Linux Reference:
drivers/pci/pcie/aer.c
drivers/pci/pcie/dpc.c