linux任务优先级

这篇笔记记录了linux任务（指线程而非进程）优先级相关的概念，以及用户态可以用来操作这些优先级的系统调用。

基本概念

调度策略

linux内核中的调度器为任务定义了调度策略，也叫调度类，每个任务同一时刻都有唯一的调度策略，这些调度策略按照优先级从高到低依次为：

SCHED_DEADLINE

内核在3.14引入了Deadline调度策略，适用于那些需要周期性执行，并且必须在指定时间内完成的任务，其优先级最高。Deadline任务的使用场景较少，不是我们关注的重点。

SCHED_FIFO、SCHED_RR

采用这两种调度策略的任务叫做RT任务，这两种调度策略的优先级相同。

SCHED_FIFO任务一旦就绪，它会立即抢占比自己优先级低的任务，它一旦开始运行，除非被更高优先级的RT任务抢占，或者自己主动让出CPU，否则它会一直运行，这类任务没有时间片的限制。

SCHED_RR在SCHED_FIFO的基础上增加了时间片约束，它每次至多运行一段时间，之后如果还没有运行完也会让出CPU，继续下一次轮转，所以将Round-robin。

SCHED_OTHER、SCHED_BATCH、SCHED_IDLE

虽然这是三种调度策略，但是调度器基本上对它们不做区分，采用这三种调度策略的任务叫做普通任务。系统中大多数任务的调度策略都是这一类。这类任务共享CPU时间，由内核大名鼎鼎的CFS算法调度运行。

任务优先级

上面的调度策略决定了任务的第一级优先级，其概念是很清晰的，但是linux对于相同或同一类调度策略下各个任务之间的优先级概念就比较混乱，原因是用户态和内核态对这些优先级的叫法不统一。

用户态角度

对于普通任务，用户态是用nice值来表述它们的优先级的，nice值取值[-20, 19]。nice值越大，表示任务对CPU约nice，其优先级最低。

对于RT任务，用户态称其优先级为调度优先级，其值越大，优先级越高，取值范围为[sched_get_priority_min(2), sched_get_priority_min(2)]，在linux上总是返回[1, 99]。

内核态角度

task_struct中定义了如下和优先级有关的字段，下面会介绍这些字段的含义。

struct task_struct {
...
    int prio, static_prio, normal_prio;
    unsigned int rt_priority;
    unsigned int policy;
}

对于普通任务，其用户态的nice值到了内核被转换为静态优先级保存到task_struct的static_prio字段中。下面的NICE_TO_PRIO宏会将用户态的nice值范围从[-20, 19]线性映射到[100, 139]范围，依然是值越小优先级越高。

#define MAX_USER_RT_PRIO    100
#define MAX_RT_PRIO        MAX_USER_RT_PRIO

// NICE_WIDTH为40，表示nice的等级
#define MAX_PRIO        (MAX_RT_PRIO + NICE_WIDTH) // 139
#define DEFAULT_PRIO        (MAX_RT_PRIO + NICE_WIDTH / 2) // 120

/*
 * Convert user-nice values [ -20 ... 0 ... 19 ]
 * to static priority [ MAX_RT_PRIO..MAX_PRIO-1 ],
 * and back.
 */
#define NICE_TO_PRIO(nice)    ((nice) + DEFAULT_PRIO)
#define PRIO_TO_NICE(prio)    ((prio) - DEFAULT_PRIO)

void set_user_nice(struct task_struct *p, long nice)
{
...
    p->static_prio = NICE_TO_PRIO(nice);
}

对于RT任务，其用户态的调度优先级到了内核会被原封不动的保存到task_struct的rt_priority字段中，我们可以称之为实时优先级。

static void __setscheduler_params(struct task_struct *p,
        const struct sched_attr *attr)
{
...
    p->rt_priority = attr->sched_priority;
}

内核态归一化优先级

任务的调度策略和优先级上述几种类型，内核态调度器在处理时将这些不同类型的调度策略和优先级进行了归一化处理，将它们映射到了一个线性区间，并使得它们单调性保持一致。归一化后的任务优先级保存在task_struct的normal_prio字段中。

static inline int __normal_prio(struct task_struct *p)
{
    return p->static_prio;
}

static inline int normal_prio(struct task_struct *p)
{
    int prio;

    if (task_has_dl_policy(p))
        prio = MAX_DL_PRIO-1;
    else if (task_has_rt_policy(p))
        prio = MAX_RT_PRIO-1 - p->rt_priority; // MAX_RT_PRIO值为100，将RT任务的调度优先级单调性进行反转
    else
        prio = __normal_prio(p);
    return prio;
}

归一化处理过程实现的效果如下图所示：

Deadline任务的normal_prio字段为-1，RT任务的normal_prio字段范围为[0, 100)，普通任务的normal_prio字段的范围为[100, 139)，这样归一化优先级可以统一表达所有任务的优先级，并且规定归一化优先级值越小优先级越高。

内核态动态优先级

归一化优先级也不是调度器最终用于调度的优先级，这是因为调度器有时候会针对RT任务临时性的调整其优先级，因此又引入了动态优先级。动态优先级被保存在了task_struct的prio字段中。动态优先级通过effective_prio()函数获取。

/*
 * Calculate the current priority, i.e. the priority
 * taken into account by the scheduler. This value might
 * be boosted by RT tasks, or might be boosted by
 * interactivity modifiers. Will be RT if the task got
 * RT-boosted. If not then it returns p->normal_prio.
 */
static int effective_prio(struct task_struct *p)
{
    p->normal_prio = normal_prio(p);
    /*
     * If we are RT tasks or we were boosted to RT priority,
     * keep the priority unchanged. Otherwise, update priority
     * to the normal priority:
     */
    if (!rt_prio(p->prio))
        return p->normal_prio;
    return p->prio;
}

用户态接口

从系统手册sched(7)中可以看到，linux共提供了如下接口供用户态获取和调整任务的优先级。从下面的介绍中可以看到，不同的接口有其适用范围。

接口	描述
nice(2)	调整调用线程的nice值
getpriority(2)、setpriority(2)	操作进程、进程组或用户的所有进程的nice值
sched_setscheduler(2)、sched_getscheduler(2)	获取线程的调度策略，设置RT任务的调度策略和调度优先级
sched_setparam(2)、 sched_getparam(2)	sched_setscheduler(2)、sched_getscheduler(2)的变体
sched_setattr(2)、sched_getattr(2)	同时支持RT任务和普通任务调度策略和优先级的接口，该接口在实现上述接口的所有功能外还有额外的扩展功能

nice(2)

将调用线程的nice值加上参数inc，所以正的inc可以降低调用线程的优先级，负的inc可以提高调用线程的优先级。

int nice(int inc);

线程的nice值不是可以随意调整的，其可设置的上限受getrlimit(2)中的RLIMIT_NICE值限制。RLIMIT_NICE的取值范围为[1, 40]，假设RLIMIT_NICE的配置为rlimit_cur，那么可设置的nice值上限为20 - rlimit_cur。

getpriority(2)、setpriority(2)

这两个接口用于获取和设置任务的nice值。which和who指定了要操作的任务范围，具体有：

which=PRIO_PROCESS，who=PID。任务范围为进程中所有线程；
which=PRIO_PGRP，who=进程组长PID。任务范围为进程组中所有线程；
which=PRIO_USER，who=UID。任务范围为该用户的所有线程；

可以看到，这两个接口的功能比nice(2)要更加的强大和灵活。

int getpriority(int which, id_t who);
int setpriority(int which, id_t who, int prio);

同样的，nice值的可设置上限受getrlimit(2)中的RLIMIT_NICE值限制。

sched_setscheduler(2)、sched_getscheduler(2)

当用sched_setscheduler(2)将线程的调度策略修改为普通任务时，其param->sched_priority必须为0，即该接口不能设置普通任务的nice值。当用它将线程的调度策略修改为RT任务时，可以用param->sched_priority为RT任务指定调度优先级。

sched_getscheduler(2)可以用来获取线程的调度策略。

int sched_setscheduler(pid_t pid, int policy, const struct sched_param *param);
int sched_getscheduler(pid_t pid);

struct sched_param {
   ...
   int sched_priority;
   ...
};

这两个接口的pid为0时，表示操作的时调用线程。

sched_setparam(2)、 sched_getparam(2)

这组接口就是sched_setscheduler(2)、sched_getscheduler(2)的变体。

int sched_setparam(pid_t pid, const struct sched_param *param);
int sched_getparam(pid_t pid, struct sched_param *param);

sched_setattr(2)、sched_getattr(2)

这两个接口是linux特有的，并非POSIX接口，它们支持所有类型任务的调度策略和优先级调整。

int sched_setattr(pid_t pid, struct sched_attr *attr, unsigned int flags);
int sched_getattr(pid_t pid, struct sched_attr *attr, unsigned int size, unsigned int flags);

struct sched_attr {
   u32 size;              /* Size of this structure */
   u32 sched_policy;      /* Policy (SCHED_*) */
   u64 sched_flags;       /* Flags */
   s32 sched_nice;        /* Nice value (SCHED_OTHER, SCHED_BATCH) */
   u32 sched_priority;    /* Static priority (SCHED_FIFO, SCHED_RR) */
   /* Remaining fields are for SCHED_DEADLINE */
   u64 sched_runtime;
   u64 sched_deadline;
   u64 sched_period;
};