linux源码解读（十五）：红黑树在内核的应用——CFS调度器

batsom · 2022-10-09 14:11:31

1、在现代的操作系统中，进程调度是最核心的功能之一；linux 0.11的调度算法简单粗暴：遍历task_struct数组，找到时间片counter最大的进程执行；显然这种策略已经不适合越来越复杂的业务场景需求了，所以后来逐步增加了多种调度策略，目前最广为人知的调度策略有5种：cfs、idle、deadline、realtime、stop，并且这5种调度策略都是同时存在的，不排除后续增加新的调度策略，怎么才能更方便地统一管理存量和增量的调度策略了？从 2.6.23开始引入了sched_class，如下：

struct sched_class {
    const struct sched_class *next;
    /*
    1、全是成员函数：这里用函数指针来表达；
    2、调度的方式有很多种，比如cfs、rt、idle、deadline，每种方式的实现方法肯定不同，这里提供接口函数让不同的调度方式各自去实现（类似驱动的struct file_operations *ops结构体）
    */
    void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);/*任务加入队列；cfs就是在红黑树插入节点*/
    void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);/*任务移除队列；cfs就是在红黑树删除节点*/
    void (*yield_task) (struct rq *rq);/*让出任务*/
    bool (*yield_to_task) (struct rq *rq, struct task_struct *p, bool preempt);/*让出到任务*/

    void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags);

    /*
     * It is the responsibility of the pick_next_task() method that will
     * return the next task to call put_prev_task() on the @prev task or
     * something equivalent.
     *
     * May return RETRY_TASK when it finds a higher prio class has runnable
     * tasks.
     */
    struct task_struct * (*pick_next_task) (struct rq *rq,
                        struct task_struct *prev,
                        struct pin_cookie cookie);
    void (*put_prev_task) (struct rq *rq, struct task_struct *p);

#ifdef CONFIG_SMP
    int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
    void (*migrate_task_rq)(struct task_struct *p);

    void (*task_woken) (struct rq *this_rq, struct task_struct *task);

    void (*set_cpus_allowed)(struct task_struct *p,
                 const struct cpumask *newmask);

    void (*rq_online)(struct rq *rq);
    void (*rq_offline)(struct rq *rq);
#endif

    void (*set_curr_task) (struct rq *rq);
    void (*task_tick) (struct rq *rq, struct task_struct *p, int queued);
    void (*task_fork) (struct task_struct *p);
    void (*task_dead) (struct task_struct *p);

    /*
     * The switched_from() call is allowed to drop rq->lock, therefore we
     * cannot assume the switched_from/switched_to pair is serliazed by
     * rq->lock. They are however serialized by p->pi_lock.
     */
    void (*switched_from) (struct rq *this_rq, struct task_struct *task);
    void (*switched_to) (struct rq *this_rq, struct task_struct *task);
    void (*prio_changed) (struct rq *this_rq, struct task_struct *task,
                 int oldprio);

    unsigned int (*get_rr_interval) (struct rq *rq,
                     struct task_struct *task);

    void (*update_curr) (struct rq *rq);

#define TASK_SET_GROUP  0
#define TASK_MOVE_GROUP    1

#ifdef CONFIG_FAIR_GROUP_SCHED
    void (*task_change_group) (struct task_struct *p, int type);
#endif
};

这个class把调度中涉及到的方法全部抽象出来定义成函数指针，不同的调度算法对于函数的实现肯定不一样，linux内核直接调用这些函数指针就能达到使用不同调度策略的目的了，是不是很巧妙了？和设备驱动的file_operations结构体思路是一样的（函数指针的接口分别由各个厂家的驱动实现，但是接口名称保持一致）！不同调度策略/实例的关系和代码文件如下：
FluxBB bbcode 测试

2、介绍CFS之前，先总结一下linux调度的类型和背景：

（1）基于时间片轮询，又称O(n)调度：每次调度都需要遍历所有的task_struct，找到时间片最大的执行；如果进程很多，导致task_struct很长，每次光是遍历就很耗时，时间复杂度是O(n)；n是task_struct的个数；除此以外，还有比较明显的缺陷：

SMP系统扩展不好，访问run queue需要加锁
实时进程不能立即调度
cpu可能空转
进程在多个cpu之间来回跳转，降低性能

（2）上面的调度很耗时，核心因素就是每次都要遍历所有的task_struct去寻找时间片最大的进程，时间复杂度被抬高到了O(n)，并且也没有优先级的功能，这两点该怎么改进了？O(1)算法由此诞生，简单来说，先把所有的任务按照不同的优先级加入不同的队列，然后先调度优先级高的队列，由此专门诞生了prio_array结构体来支撑算法，如下：

#define MAX_USER_RT_PRIO    100
#define MAX_RT_PRIO         MAX_USER_RT_PRIO
#define MAX_PRIO            (MAX_RT_PRIO + 40)//140个优先级（0 ~ 139，数值越小优先级越高）

#define BITMAP_SIZE ((((MAX_PRIO+1+7)/8)+sizeof(long)-1)/sizeof(long))

struct prio_array {
    int nr_active;//所有优先级队列中的总任务数。
    unsigned long bitmap[BITMAP_SIZE];//每个位对应一个优先级的任务队列，用于记录哪个任务队列不为空，能通过 bitmap 够快速找到不为空的任务队列
    struct list_head queue[MAX_PRIO];//优先级队列数组，每个元素维护一个优先级队列，比如索引为0的元素维护着优先级为0的任务队列
};

图示如下：先扫描bitmap，找到不为空的队列去调度（比如这里的2、6号队列不为空）；由于bitmap的大小是固定的，所以遍历的时间也是固定的，时间复杂度自然是O(1)了；因为数值越低、优先级越高，所以从bitmap的0开始遍历，找到第一个不为空的队列就可以停止遍历了，这里又节约了时间，所以整体的效率比简单粗暴的时间片轮询高多了！总结一下：O(1)调度算法的本质就是把大量的任务按照优先级分队列，从优先级高的队列开始执行，避免了时间片轮询那种“眉毛胡子一把抓”的混乱，是一种典型的空间换时间的思路！

FluxBB bbcode 测试

相比时间片轮询，O(1)算法确实做了比较大的改进，但是自身也不是100%完美无瑕（否则就不会后后续其他的调度算法了），比如：

交互性较强的任务要再次运行，就需要等待当前等待队列中的所有任务都执行完成：比如进程需要用户输入时阻塞，但并不是用户输入后马上唤醒，而是同队列其他任务都运行完后才继续执行，可能导致交互不及时，产生卡顿的感觉，影响用户体验
不能保证在给定的时间间隔内，为每个任务分配的时间与其优先级是成正比的；这个问题是上面问题引申出来的：比如A进程优先级高，分配了20ms，B进程分配10ms，但是A被阻塞了，cpu转而执行B；A要等到B执行完成后才会继续，所以A优先级高完全没体现出来，名存实亡！

为了解决上面的问题，CFS诞生了！

3、（1）不管用哪种调度方式，首先要找到进程的task_struct；由于上层业务应用需求多种多样，操作系统肯定会不停的创建、运行和销毁进程，导致进程的状态时刻都在变化，进程的权重/优先级肯定也要不停地变化，这么多的关键因素都在改变，怎么高效、快速地管理这些不断变化的进程了？以往的调度策略用的最多的就是链表了，根据不同的进程状态、优先级等影响因素加入不同的队列，但是链表有个致命弱点：只能顺序遍历，导致增删改查效率极低。基于链表这种数据结构，又发明了红黑树，本质是把原来链表“平铺直叙”式的顺序排列改成了按照大小的树形排列，此时再增删改查的效率就要高很多了！那么问题又来了：既然红黑树需要按照节点某个值的大小排序，选哪个值比较适合了？linux开发人员选择的是vruntime！计算公式如下：

vruntime = vruntime + 实际运行时间（time process run) * 1024 / 进程权重(load weight of this process)

注意：vruntime是累加的！实际运行时间就是进程运行时暂用cpu的时间，权重该怎么计算了？这里有个映射表，根据nice值查找对应的weight！nice值类似于优先级，取值为下面所示的从15到-20，每次递减5；根据nice值找到weight后就可以带入公式计算vruntime了！

/*
 * Nice levels are multiplicative, with a gentle 10% change for every
 * nice level changed. I.e. when a CPU-bound task goes from nice 0 to
 * nice 1, it will get ~10% less CPU time than another CPU-bound task
 * that remained on nice 0.
 *
 * The "10% effect" is relative and cumulative: from _any_ nice level,
 * if you go up 1 level, it's -10% CPU usage, if you go down 1 level
 * it's +10% CPU usage. (to achieve that we use a multiplier of 1.25.
 * If a task goes up by ~10% and another task goes down by ~10% then
 * the relative distance between them is ~25%.)
 */
const int sched_prio_to_weight[40] = {
 /* -20 */     88761,     71755,     56483,     46273,     36291,
 /* -15 */     29154,     23254,     18705,     14949,     11916,
 /* -10 */      9548,      7620,      6100,      4904,      3906,
 /*  -5 */      3121,      2501,      1991,      1586,      1277,
 /*   0 */      1024,       820,       655,       526,       423,
 /*   5 */       335,       272,       215,       172,       137,
 /*  10 */       110,        87,        70,        56,        45,
 /*  15 */        36,        29,        23,        18,        15,
};

vruntime的值越小，说明占用cpu的时间就越少，或者说权重越大，这时就需要优先运行了！所以用红黑树是根据所有进程的vruntime来组织的，树最左下角的节点就是vruntime最小的节点，是需要最先执行的节点；随着进程的执行，或者说权重的调整，vruntime是不停在变化的，此时就需要动态调整红黑树了。由于红黑树本身的算法特点，动态调整肯定比链表快多了，这是CFS选择红黑树的根本原因！看到这里，CFS算法的特点之一就明显了：没有时间片的概念，而是根据实际的运行时间和虚拟运行时间来对任务进行排序，从而选择调度；

（2）算法原理介绍完，接着该看看linux内核是怎么实现的了！和其他模块一样，CFS的实现少不了结构体的支持，算法相关的核心结构体如下：

第一个肯定是task_struct了！新增了不同调度器的描述符，便于确定本进程使用了哪些调度策略！sched_entity就是构造红黑树的关键成员变量了！

struct task_struct {
    .......
        int prio, static_prio, normal_prio;
    unsigned int rt_priority;
    const struct sched_class *sched_class;/*调度策略的实例*/
    struct sched_entity se;/*cfs调度策略，包含了rb_node*/
    struct sched_rt_entity rt;/*real time调度策略*/
#ifdef CONFIG_CGROUP_SCHED
    struct task_group *sched_task_group;
#endif
    struct sched_dl_entity dl;/*deadline 调度*/
    .......
}

组成红黑树的关键结构体：有个run_node字段，从名字就能看出是正在运行的进程节点！

struct sched_entity {/*cfs调度策略*/
    struct load_weight    load;        /* for load-balancing */
    struct rb_node        run_node;   /*调度实体是由红黑树组织起来的*/ 
    struct list_head    group_node;
    unsigned int        on_rq;
    /*构造红黑树时，其实下面的每一项都可以用作节点的key；
    1、但是这里选vruntime作为key构造红黑树，换句话说用vruntime来排序，小的靠左，大的靠右
    2、如果不同进程的vruntime一样，可以加很小的数改成不一样的
    */
    u64            exec_start;
    u64            sum_exec_runtime;
    u64            vruntime;/*红黑树节点排序的变量*/
    u64            prev_sum_exec_runtime;

    u64            nr_migrations;

#ifdef CONFIG_SCHEDSTATS
    struct sched_statistics statistics;
#endif

#ifdef CONFIG_FAIR_GROUP_SCHED
    int            depth;
    struct sched_entity    *parent;
    /* rq on which this entity is (to be) queued: */
    struct cfs_rq        *cfs_rq;
    /* rq "owned" by this entity/group: */
    struct cfs_rq        *my_q;
#endif

#ifdef CONFIG_SMP
    /*
     * Per entity load average tracking.
     *
     * Put into separate cache line so it does not
     * collide with read-mostly values above.
     */
    struct sched_avg    avg ____cacheline_aligned_in_smp;
#endif
};

还有直接描述cfs正在runquene的结构体：包含红黑树的根节点、最左边的节点（也就是vruntime最小的节点）、当前正在使用的调度结构体；

/* CFS-related fields in a runqueue */
struct cfs_rq {
......
    struct rb_root tasks_timeline;/*红黑树的root根节点*/
    struct rb_node *rb_leftmost;/*红黑树最左边的节点，也就是vruntime最小的节点*/

    /*
     * 'curr' points to currently running entity on this cfs_rq.
     * It is set to NULL otherwise (i.e when none are currently running).
     */
    struct sched_entity *curr, *next, *last, *skip;
......

}

上面各种结构体种类繁多，不容易理清关系，看看下面的图就清晰了：
FluxBB bbcode 测试

结构体准备好后，就可以通过各种api建树了！

（3）既然红黑树排序以vruntime为准，这个值肯定是要不断调整的，具体的更改函数在update_curr函数(kernel\sched\fair.c)，如下：关键代码处加了中文注释

/*
 * Update the current task's runtime statistics.
 */
static void update_curr(struct cfs_rq *cfs_rq)
{
    struct sched_entity *curr = cfs_rq->curr;
    u64 now = rq_clock_task(rq_of(cfs_rq));
    u64 delta_exec;

    if (unlikely(!curr))
        return;
    /*计算进程已经执行的时间*/
    delta_exec = now - curr->exec_start;
    if (unlikely((s64)delta_exec <= 0))
        return;

    curr->exec_start = now;//更新开始执行的时间

    schedstat_set(curr->statistics.exec_max,
              max(delta_exec, curr->statistics.exec_max));

    curr->sum_exec_runtime += delta_exec;
    schedstat_add(cfs_rq->exec_clock, delta_exec);

    curr->vruntime += calc_delta_fair(delta_exec, curr);//更新vruntime
    update_min_vruntime(cfs_rq);

    if (entity_is_task(curr)) {
        struct task_struct *curtask = task_of(curr);

        trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
        cpuacct_charge(curtask, delta_exec);
        account_group_exec_runtime(curtask, delta_exec);
    }

    account_cfs_rq_runtime(cfs_rq, delta_exec);
}

把节点加入红黑树：

/*
 * Enqueue an entity into the rb-tree:
 */
static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
    struct rb_node **link = &cfs_rq->tasks_timeline.rb_node;/*红黑树根节点*/
    struct rb_node *parent = NULL;
    struct sched_entity *entry;
    int leftmost = 1;

    /*
     * Find the right place in the rbtree:
     */
    while (*link) {
        parent = *link;
        /*找到节点实例的首地址，就是container_of的宏定义*/
        entry = rb_entry(parent, struct sched_entity, run_node);
        /*
         * We dont care about collisions. Nodes with
         * the same key stay together.
         */
        if (entity_before(se, entry)) {
            link = &parent->rb_left;
        } else {
            link = &parent->rb_right;
            leftmost = 0;
        }
    }

    /*
     * Maintain a cache of leftmost tree entries (it is frequently
     * used):
     */
    if (leftmost)
        cfs_rq->rb_leftmost = &se->run_node;
    /*在红黑树中插入节点，即将node节点插入到parent节点的左树或者右树，整个过程会动态调整树结构保持平衡；*/
    rb_link_node(&se->run_node, parent, link);
    /*设置节点的颜色*/
    rb_insert_color(&se->run_node, &cfs_rq->tasks_timeline);
}

和上面的作用刚好相反：删除节点

static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
    if (cfs_rq->rb_leftmost == &se->run_node) {
        struct rb_node *next_node;

        next_node = rb_next(&se->run_node);
        cfs_rq->rb_leftmost = next_node;
    }

    rb_erase(&se->run_node, &cfs_rq->tasks_timeline);
}

（4）红黑树建好后，最最最最重要的功能就是找出需要调度的进程了，如下：

/*
 * Pick the next process, keeping these things in mind, in this order:
 * 1) keep things fair between processes/task groups
 * 2) pick the "next" process, since someone really wants that to run
 * 3) pick the "last" process, for cache locality
 * 4) do not run the "skip" process, if something else is available
 */
static struct sched_entity *
pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
    struct sched_entity *left = __pick_first_entity(cfs_rq);//树上最左边的节点
    struct sched_entity *se;

    /*
     * If curr is set we have to see if its left of the leftmost entity
     * still in the tree, provided there was anything in the tree at all.
       若 second 为空, 或者 curr 的 vruntime 更小 
     */
    if (!left || (curr && entity_before(curr, left)))
        left = curr;

    se = left; /* ideally we run the leftmost entity */

    /*
     * Avoid running the skip buddy, if running something else can
     * be done without getting too unfair.
     */
    if (cfs_rq->skip == se) {
        struct sched_entity *second;

        if (se == curr) {
            second = __pick_first_entity(cfs_rq);/*返回最左边、也就是vruntime最小的节点*/
        } else {
            second = __pick_next_entity(se);/*找到比se节点大的第一个节点*/
            if (!second || (curr && entity_before(curr, second)))
                second = curr;
        }
        /*判断是否应该抢占当前进程*/
        if (second && wakeup_preempt_entity(second, left) < 1)
            se = second;
    }

    /*
     * Prefer last buddy, try to return the CPU to a preempted task.
     */
    if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
        se = cfs_rq->last;

    /*
     * Someone really wants this to run. If it's not unfair, run it.
     */
    if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
        se = cfs_rq->next;

    clear_buddies(cfs_rq, se);

    return se;
}

这里啰嗦几句：调度器有多个，都实现了pick_next_entity的方法！
FluxBB bbcode 测试

总结：

1、链表这种“憨憨”类的数据结构，能少用就尽量少用；尽量用红黑树替代吧，增删改查的效率高多了！

2、既然用红黑树选择最小的vruntime节点运行，用小根堆是不是也能达到同样的效果了？

参考：

1、https://jishuin.proginn.com/p/763bfbd5f8d6 linux进程调度知识点

2、https://mp.weixin.qq.com/s?__biz=MzA3NzYzODg1OA==&mid=2648464309&idx=1&sn=9fc763d9233fbba6d40b69b1ef54aa8b&chksm=87660610b0118f060a4da0c64417e57e8cb35f4732043106fd1b1d9a3ad6134145e0e47f5a9b&scene=21#wechat_redirect O(1)调度算法

3、https://blog.csdn.net/longwang155069/article/details/104457109 linux O(1)调度器

4、http://www.wowotech.net/process_management/scheduler-history.html O(1)、O(n)和CFS调度器

5、https://zhuanlan.zhihu.com/p/372441187 操作系统调度算法CFS

Gentoo中文社区

公告

#1 2022-10-09 14:11:31

linux源码解读（十五）：红黑树在内核的应用——CFS调度器

页脚