1. SMS 在 GCC 中的實現

1.1. 一些基本概念

(1)軟流水(Software pipelining )是一種通過重疊不同迭代的指令,使其並行執行,從而改進迴圈中指令排程的技術。關鍵思想是找到一種操作模式(稱為核心程式碼),當反覆迭代時,它會產生這樣一種效果,即在前一個迭代完成之前啟動下一個迭代。下圖為包含4條指令的迴圈經過軟流水排程後的結果

(2)模排程(Modulo scheduling)是一種實現軟流水的方法,聚焦於最小化迴圈的平均週期計數,從而優化效能。

(3)在本文中,我們描述了SMS(Swing Modulo Scheduling)在GCC中的實現,這是一種模排程技術,它同時關注於降低調壓器壓力。

1.2. SMS 處理的迴圈需要滿足如下約束:

(1)迴圈的迭代次數是已知的(或者生成執行時動態測試迭代次數的程式碼,並根據測試結果選擇是否執行軟流水)

(2)迴圈體是單個BB塊

1.3. SMS 按如下步驟處理迴圈:

1.3.1. 構建資料依賴圖(DDG),ddg的node表示RTL指令,ddg的edges表示迴圈內部和迴圈外部的依賴關係。

1.3.2. 模排程器按照如下步驟處理迴圈:

(1)計算最小啟動間隔(MII)

(2)計算ddg中結點的排程順序 (Swinging)

(3)排程 kernel

(4)執行模變數擴充套件

(5)生成填充部分(prologue)和排空部分(epilogue)

(6)生成迴圈迭代次數的預判斷(可選)

2. 結合GCC程式碼,詳細介紹 1.3.2. 的內容

GCC 使用的是展開-壓實軟流水(unroll-and-compact software pipelining)的軟流水方法。它將迴圈體展開若干次,然後在指令中選擇每一步都要啟動的重複模式。如果找到這種模式,它就用該模式構造流水的迴圈體,並用該模式之前和之後的指令構造prologue和epilogue。FIG. 17.23-17.25是鯨書中的舉一個例子。這是一種無約束的貪婪排程,其中假設每個cycle上列出的指令都能並行執行,即不考慮暫存器個數和處理機的指令級並行。

   

2.1. 計算最小啟動間隔(MII)

最小啟動間隔(MII):通過任何可行的排程,執行迴圈kernel所需要的週期數下界(可以理解為 硬體資源充足 且 支援指令多發射 的前提下,處理器執行一次迴圈kernel的程式碼,所需要的最小cycle)

mii = 1; /* Need to pass some estimate of mii.  */
rec_mii = sms_order_nodes (g, mii, node_order, &max_asap);
mii = MAX (res_MII (g), rec_mii);
maxii = MAX (max_asap, MAXII_FACTOR * mii);

如果一次排程滿足所有的指令間依賴約束及其相關latency,並且避免所有潛在的資源衝突,則排程是可行的。因此,我們計算兩個獨立的邊界:

(1)資料依賴 recMII:基於迴圈依賴的週期數

(2)資源依賴 resMII:基於資源的可獲取性和指令對硬體資源的需求

計算MII的目的是避免嘗試太小的II,從而加快模排程的編譯速度。MII作為下界(lower-bound),不會影響排程結果的正確性。

同樣,計算MaxII作為上界(upper-bound)也是為了限制II的查詢範圍,減少編譯時長。

所以,II 的嘗試區間為 [ MII , MaxII )

2.2. 計算ddg中結點的排程順序 (Swinging order)

“Swinging” order的目的是,在完成一條指令的排程後,立刻嘗試排程DDG中它的前驅指令或後繼指令,並使它們儘可能的接近,以縮短虛擬暫存器的活動範圍,從而降低暫存器壓力。

2.2.1 sms_order_nodes 原始碼分析

/* Order the nodes of G for scheduling and pass the result in
NODE_ORDER. Also set aux.count of each node to ASAP.
Put maximal ASAP to PMAX_ASAP. Return the recMII for the given DDG. */
static int
sms_order_nodes (ddg_ptr g, int mii, int * node_order, int *pmax_asap)
{
int i;
int rec_mii = 0;
ddg_all_sccs_ptr sccs = create_ddg_all_sccs (g); nopa nops = calculate_order_params (g, mii, pmax_asap); if (dump_file)
print_sccs (dump_file, sccs, g); order_nodes_of_sccs (sccs, node_order); if (sccs->num_sccs > 0)
/* First SCC has the largest recurrence_length. */
rec_mii = sccs->sccs[0]->recurrence_length; /* Save ASAP before destroying node_order_params. */
for (i = 0; i < g->num_nodes; i++)
{
ddg_node_ptr v = &g->nodes[i];
v->aux.count = ASAP (v);
} free (nops);
free_ddg_all_sccs (sccs);
check_nodes_order (node_order, g->num_nodes); return rec_mii;
}

在計算 res_MII 時呼叫了函式 sms_order_nodes,該函式做了三件事情

(1)create_ddg_all_sccs :根據 DDG 的回邊(backarcs)查詢 SCC,並按照 recurrence_length 遞減順序對 SCCS 進行快排。並將 SCCS 中最大的 recurrence_length 作為 rec_mii(sccs->sccs[0]->recurrence_length)

(2)calculate_order_params : 計算 DDG 中每個節點的 ASAP(最早排程時間),ALAP(最晚排程時間),HEIGHT(指的是終止節點[沒有succ的節點]到該節點的最大長度)。下面列舉了與節點排序相關的5個巨集定義,另外兩個的含義分別是 MOB(該節點可排程空間),DEPTH(起始節點[沒有pred的節點]到該節點的最大長度)

#define ORDER_PARAMS(x) ((struct node_order_params *) (x)->aux.info)
#define ASAP(x) (ORDER_PARAMS ((x))->asap)
#define ALAP(x) (ORDER_PARAMS ((x))->alap)
#define HEIGHT(x) (ORDER_PARAMS ((x))->height)
#define MOB(x) (ALAP ((x)) - ASAP ((x)))
#define DEPTH(x) (ASAP ((x)))

(3)order_nodes_of_sccs :以強連通分量為切入點,根據(2)中計算得到的引數,對所有節點的排程順序進行排序。詳細的排序演算法見 2.2.2

2.2.2 order_nodes_of_sccs 原始碼分析

static void
order_nodes_of_sccs (ddg_all_sccs_ptr all_sccs, int * node_order)
{
int i, pos = 0;
ddg_ptr g = all_sccs->ddg;
int num_nodes = g->num_nodes;
auto_sbitmap prev_sccs (num_nodes);
auto_sbitmap on_path (num_nodes);
auto_sbitmap tmp (num_nodes);
auto_sbitmap ones (num_nodes); bitmap_clear (prev_sccs);
bitmap_ones (ones); /* Perform the node ordering starting from the SCC with the highest recMII.
For each SCC order the nodes according to their ASAP/ALAP/HEIGHT etc. */
for (i = 0; i < all_sccs->num_sccs; i++)
{
ddg_scc_ptr scc = all_sccs->sccs[i]; /* 注意:find_nodes_on_paths API 搜尋由 prev_sccs 到 scc->nodes 的有向路徑,
結果 on_path 中除了保留有向路徑中的節點,也會保留 prev_sccs 和
scc->nodes 的節點. */
/* Add nodes on paths from previous SCCs to the current SCC. */
find_nodes_on_paths (on_path, g, prev_sccs, scc->nodes);
bitmap_ior (tmp, scc->nodes, on_path); /* Add nodes on paths from the current SCC to previous SCCs. */
find_nodes_on_paths (on_path, g, scc->nodes, prev_sccs);
bitmap_ior (tmp, tmp, on_path); /* Remove nodes of previous SCCs from current extended SCC. */
bitmap_and_compl (tmp, tmp, prev_sccs); pos = order_nodes_in_scc (g, prev_sccs, tmp, node_order, pos);
/* Above call to order_nodes_in_scc updated prev_sccs |= tmp. */
} /* Handle the remaining nodes that do not belong to any scc. Each call
to order_nodes_in_scc handles a single connected component. */
while (pos < g->num_nodes)
{
bitmap_and_compl (tmp, ones, prev_sccs);
pos = order_nodes_in_scc (g, prev_sccs, tmp, node_order, pos);
}
}

演算法分為兩步。First Step,我們按如下步驟,將DDG劃分為子集 S2 來構造節點的區域性順序

(1)Find the SCC (Strongly Connected Component)/Recurrence of the data-dependence graph having the largest recMII—this is the first set of nodes S1.
(2)Find the SCC with the next largest recMII, put its nodes into the next set S2.
(3)Find all nodes that are on directed paths from any previous set (對應原始碼中變數 prev_sccs) to the next set S2 (對應原始碼中變數 tmp) and add them to the next set S2.
(4)If there are additional SCCs in the dependence graph goto step 2. If there are no additional SCCs, create a new (last) set of all the remaining nodes.

Second Step ,呼叫子函式 order_nodes_in_scc 處理子集 S2,這裡分為兩種排序模式

(1)BOTTOMUP:優先排程已排序節點中的前驅節點(predecessors)。順序由 find_max_dv_min_mob 介面決定,第一優先順序是 max DEPTH,第二優先順序是 min MOB

(2)TOPDOWN:優先排程已排序節點中的後繼節點(successors)。順序由 find_max_hv_min_mob 介面決定,第一優先順序是 max HEIGHT,第二優先順序是 min MOB

每完成一個節點的排序,它的後驅或者前驅會加入到workset中,參與接下來的節點排序

/* Places the nodes of SCC into the NODE_ORDER array starting
at position POS, according to the SMS ordering algorithm.
NODES_ORDERED (in&out parameter) holds the bitset of all nodes in
the NODE_ORDER array, starting from position zero. */
static int
order_nodes_in_scc (ddg_ptr g, sbitmap nodes_ordered, sbitmap scc,
int * node_order, int pos)
{
enum sms_direction dir;
int num_nodes = g->num_nodes;
auto_sbitmap workset (num_nodes);
auto_sbitmap tmp (num_nodes);
sbitmap zero_bitmap = sbitmap_alloc (num_nodes);
auto_sbitmap predecessors (num_nodes);
auto_sbitmap successors (num_nodes); bitmap_clear (predecessors);
find_predecessors (predecessors, g, nodes_ordered); bitmap_clear (successors);
find_successors (successors, g, nodes_ordered); bitmap_clear (tmp);
if (bitmap_and (tmp, predecessors, scc))
{
bitmap_copy (workset, tmp);
dir = BOTTOMUP;
}
else if (bitmap_and (tmp, successors, scc))
{
bitmap_copy (workset, tmp);
dir = TOPDOWN;
}
else
{
int u; bitmap_clear (workset);
if ((u = find_max_asap (g, scc)) >= 0)
bitmap_set_bit (workset, u);
dir = BOTTOMUP;
} bitmap_clear (zero_bitmap);
while (!bitmap_equal_p (workset, zero_bitmap))
{
int v;
ddg_node_ptr v_node;
sbitmap v_node_preds;
sbitmap v_node_succs; if (dir == TOPDOWN)
{
while (!bitmap_equal_p (workset, zero_bitmap))
{
v = find_max_hv_min_mob (g, workset);
v_node = &g->nodes[v];
node_order[pos++] = v;
v_node_succs = NODE_SUCCESSORS (v_node);
bitmap_and (tmp, v_node_succs, scc); /* Don't consider the already ordered successors again. */
bitmap_and_compl (tmp, tmp, nodes_ordered);
bitmap_ior (workset, workset, tmp);
bitmap_clear_bit (workset, v);
bitmap_set_bit (nodes_ordered, v);
}
dir = BOTTOMUP;
bitmap_clear (predecessors);
find_predecessors (predecessors, g, nodes_ordered);
bitmap_and (workset, predecessors, scc);
}
else
{
while (!bitmap_equal_p (workset, zero_bitmap))
{
v = find_max_dv_min_mob (g, workset);
v_node = &g->nodes[v];
node_order[pos++] = v;
v_node_preds = NODE_PREDECESSORS (v_node);
bitmap_and (tmp, v_node_preds, scc); /* Don't consider the already ordered predecessors again. */
bitmap_and_compl (tmp, tmp, nodes_ordered);
bitmap_ior (workset, workset, tmp);
bitmap_clear_bit (workset, v);
bitmap_set_bit (nodes_ordered, v);
}
dir = TOPDOWN;
bitmap_clear (successors);
find_successors (successors, g, nodes_ordered);
bitmap_and (workset, successors, scc);
}
}
sbitmap_free (zero_bitmap);
return pos;
}

2.3. 排程核心 kernel

經過 2.1 中 II 的計算,我們得到了 MII 和 MaxII ;經過 2.2 節點排序之後,我們得到了一個包含節點排程順序的陣列 node_order;在本節中,我們將前面得到的結果作為輸入,呼叫 sms_schedule_by_order 介面, 根據預先計算的順序為迴圈核心排程節點。

/* This function implements the scheduling algorithm for SMS according to the
above algorithm. */
static partial_schedule_ptr
sms_schedule_by_order (ddg_ptr g, int mii, int maxii, int *nodes_order)
{
int ii = mii;
int i, c, success, num_splits = 0;
int flush_and_start_over = true;
int num_nodes = g->num_nodes;
int start, end, step; /* Place together into one struct? */
auto_sbitmap sched_nodes (num_nodes);
auto_sbitmap tobe_scheduled (num_nodes); partial_schedule_ptr ps = create_partial_schedule (ii, g, DFA_HISTORY); bitmap_ones (tobe_scheduled);
bitmap_clear (sched_nodes); while (flush_and_start_over && (ii < maxii))
{ if (dump_file)
fprintf (dump_file, "Starting with ii=%d\n", ii);
flush_and_start_over = false;
bitmap_clear (sched_nodes); for (i = 0; i < num_nodes; i++)
{
int u = nodes_order[i];
ddg_node_ptr u_node = &ps->g->nodes[u];
rtx_insn *insn = u_node->insn;
......
if (bitmap_bit_p (sched_nodes, u))
continue; /* Try to get non-empty scheduling window. */
success = 0;
if (get_sched_window (ps, u_node, sched_nodes, ii, &start,
&step, &end) == 0)
{
......
for (c = start; c != end; c += step)
{
/* 此處省略 precede和follow的處理 */
......
success =
try_scheduling_node_in_cycle (ps, u, c,
sched_nodes,
&num_splits, tmp_precede,
tmp_follow);
if (success)
break;
} verify_partial_schedule (ps, sched_nodes);
}
if (!success)
{
if (ii++ == maxii)
break;
/* 此處省略compute_split_row的處理方式,僅保留reset_partial_schedule的處理方式. */
......
flush_and_start_over = true;
verify_partial_schedule (ps, sched_nodes);
reset_partial_schedule (ps, ii);
verify_partial_schedule (ps, sched_nodes);
break;
} /* ??? If (success), check register pressure estimates. */
} /* Continue with next node. */
} /* While flush_and_start_over. */
if (ii >= maxii)
{
free_partial_schedule (ps);
ps = NULL;
}
else
gcc_assert (bitmap_equal_p (tobe_scheduled, sched_nodes)); return ps;
}

(1)get_sched_window:對於 node_order 中的每個節點,我們計算一個排程視窗(scheduing windows)—— a range of cycles in which we can schedule the node according to already scheduled nodes. Previously scheduled predecessors (PSP) increase the lower bound of the scheduling window, while previously scheduled successors (PSS) decrease the upper bound of the scheduling window. The cycles within the scheduling window are not bounded a-priori, and can be positive or negative. The scheduling window itself contains a range of at-most II cycles

/* Given the partial schedule PS, this function calculates and returns the
cycles in which we can schedule the node with the given index I.
NOTE: Here we do the backtracking in SMS, in some special cases. We have
noticed that there are several cases in which we fail to SMS the loop
because the sched window of a node is empty due to tight data-deps. In
such cases we want to unschedule some of the predecessors/successors
until we get non-empty scheduling window. It returns -1 if the
scheduling window is empty and zero otherwise. */
static int
get_sched_window (partial_schedule_ptr ps, ddg_node_ptr u_node,
sbitmap sched_nodes, int ii, int *start_p, int *step_p,
int *end_p)
{
int start, step, end;
int early_start, late_start; /* We first compute a forward range (start <= end), then decide whether
to reverse it. */
early_start = INT_MIN;
late_start = INT_MAX;
start = INT_MIN;
end = INT_MAX;
step = 1; /* 此處省略PSP和PSS相關的處理程式碼 */
...... /* Get a target scheduling window no bigger than ii. */
early_start = NODE_ASAP (u_node);
late_start = MIN (late_start, early_start + (ii - 1)); /* Apply memory dependence limits. */
start = MAX (start, early_start);
end = MIN (end, late_start); /* Now that we've finalized the window, make END an exclusive rather
than an inclusive bound. */
end += step; *start_p = start;
*step_p = step;
*end_p = end; if (start >= end && step == 1)
{
if (dump_file)
fprintf (dump_file, "\nEmpty window: start=%d, end=%d, step=%d\n",
start, end, step);
return -1;
} return 0;
}

(2)try_scheduling_node_in_cycle:在計算排程視窗後,我們嘗試在scheduling windows的某個cycle排程節點,同時避免資源衝突。如果成功,則標記節點及其(絕對)排程時間;如果不能在排程視窗內排程給定的節點,我們將增加II,然後重新呼叫該介面。如果II達到MaxII,我們放棄當前迴圈的SMS排程。

/* Return 1 if U_NODE can be scheduled in CYCLE.  Use the following
parameters to decide if that's possible:
PS - The partial schedule.
U - The serial number of U_NODE.
NUM_SPLITS - The number of row splits made so far.
MUST_PRECEDE - The nodes that must precede U_NODE. (only valid at
the first row of the scheduling window)
MUST_FOLLOW - The nodes that must follow U_NODE. (only valid at the
last row of the scheduling window) */
static bool
try_scheduling_node_in_cycle (partial_schedule_ptr ps,
int u, int cycle, sbitmap sched_nodes,
int *num_splits, sbitmap must_precede,
sbitmap must_follow)
{
ps_insn_ptr psi;
bool success = 0; verify_partial_schedule (ps, sched_nodes);
psi = ps_add_node_check_conflicts (ps, u, cycle, must_precede, must_follow);
if (psi)
{
SCHED_TIME (u) = cycle;
bitmap_set_bit (sched_nodes, u);
success = 1;
*num_splits = 0;
if (dump_file)
fprintf (dump_file, "Scheduled w/o split in %d\n", cycle);
} return success;
} /* Checks if the given node causes resource conflicts when added to PS at
cycle C. If not the node is added to PS and returned; otherwise zero
is returned. Bit N is set in MUST_PRECEDE/MUST_FOLLOW if the node with
cuid N must be come before/after (respectively) the node pointed to by
PS_I when scheduled in the same cycle. */
ps_insn_ptr
ps_add_node_check_conflicts (partial_schedule_ptr ps, int n,
int c, sbitmap must_precede,
sbitmap must_follow)
{
int has_conflicts = 0;
ps_insn_ptr ps_i; /* First add the node to the PS, if this succeeds check for
conflicts, trying different issue slots in the same row. */
if (! (ps_i = add_node_to_ps (ps, n, c, must_precede, must_follow)))
return NULL; /* Failed to insert the node at the given cycle. */ /* 此處省略 ps_has_conflicts 的相關處理. */
...... ps->min_cycle = MIN (ps->min_cycle, c);
ps->max_cycle = MAX (ps->max_cycle, c);
return ps_i;
}

(3)add_node_to_ps:在排程核心的過程中,我們維護一個partial schedule(以下簡寫為 ps),它將排程的指令儲存在II行中,如下所示:當一條指令排程到 cycle T 時(在其排程視窗內),它通過介面ps_insn_find_column 被插入到 ps 的某一行(row),row 的計算方式為:

row = T mod II

如果row的指令數量大於等於 issue_rate,表示指令級併發已達到上限,不能繼續增加。一旦所有指令排程成功,ps 將提供核心中指令的順序。

/* Inserts a DDG_NODE to the given partial schedule at the given cycle.
Returns 0 if this is not possible and a PS_INSN otherwise. Bit N is
set in MUST_PRECEDE/MUST_FOLLOW if the node with cuid N must be come
before/after (respectively) the node pointed to by PS_I when scheduled
in the same cycle. */
static ps_insn_ptr
add_node_to_ps (partial_schedule_ptr ps, int id, int cycle,
sbitmap must_precede, sbitmap must_follow)
{
ps_insn_ptr ps_i;
int row = SMODULO (cycle, ps->ii); if (ps->rows_length[row] >= issue_rate)
return NULL; ps_i = create_ps_insn (id, cycle); /* Finds and inserts PS_I according to MUST_FOLLOW and
MUST_PRECEDE. */
if (! ps_insn_find_column (ps, ps_i, must_precede, must_follow))
{
free (ps_i);
return NULL;
} ps->rows_length[row] += 1;
return ps_i;
} /* Unlike what literature describes for modulo scheduling (which focuses
on VLIW machines) the order of the instructions inside a cycle is
important. Given the bitmaps MUST_FOLLOW and MUST_PRECEDE we know
where the current instruction should go relative to the already
scheduled instructions in the given cycle. Go over these
instructions and find the first possible column to put it in. */
static bool
ps_insn_find_column (partial_schedule_ptr ps, ps_insn_ptr ps_i,
sbitmap must_precede, sbitmap must_follow)
{
ps_insn_ptr next_ps_i;
ps_insn_ptr first_must_follow = NULL;
ps_insn_ptr last_must_precede = NULL;
ps_insn_ptr last_in_row = NULL;
int row; if (! ps_i)
return false; row = SMODULO (ps_i->cycle, ps->ii); /* Find the first must follow and the last must precede
and insert the node immediately after the must precede
but make sure that it there is no must follow after it. */
for (next_ps_i = ps->rows[row];
next_ps_i;
next_ps_i = next_ps_i->next_in_row)
{
/* 此處省略precede和follow的處理. */
......
last_in_row = next_ps_i;
} /* The closing branch is scheduled as well. Make sure there is no
dependent instruction after it as the branch should be the last
instruction in the row. */
if (JUMP_P (ps_rtl_insn (ps, ps_i->id)))
{
if (last_in_row)
{
/* Make the branch the last in the row. New instructions
will be inserted at the beginning of the row or after the
last must_precede instruction thus the branch is guaranteed
to remain the last instruction in the row. */
last_in_row->next_in_row = ps_i;
ps_i->prev_in_row = last_in_row;
ps_i->next_in_row = NULL;
}
else
ps->rows[row] = ps_i;
return true;
} /* Now insert the node after INSERT_AFTER_PSI. */
ps_i->next_in_row = ps->rows[row];
ps_i->prev_in_row = NULL;
if (ps_i->next_in_row)
ps_i->next_in_row->prev_in_row = ps_i;
ps->rows[row] = ps_i; return true;
}
 
2.4. 執行模變數擴充套件
 
在所有指令都被排程到核心中之後,必須儲存在一次迭代中 def 並在將來的某次迭代中 use 的一些值,以免被覆蓋。當暫存器的生存範圍超過II週期時會發生這種情況- 在 use 該指值之前,def 指令將執行多次。這個問題可以通過模變數擴充套件來解決,模變數擴充套件通過生成暫存器COPY指令來實現(schedule_reg_moves),如下所示(某些平臺使用rotating-register功能在硬體上提供這種支援):
 
(1)根據以下方程式,計算給定暫存器在cycle T-def 時定義並在cycle T-use 時使用的副本數:
其中,在 ps 的同一行中,如果出現 use 在 def 之前的場景,adjustment = -1,否則為零。給定暫存器def所需的總拷貝數由最後一個 use 給出。
 
(2)按照反向 def 的順序生成拷貝指令,然後把每個 use 關聯到對應的 def 上
2.4.1 schedule_reg_moves 原始碼
這個介面完成了暫存器副本的建立,並建立了副本和use之間的關係。指令中暫存器的替換通過 apply_reg_moves 完成
 
/*
Breaking intra-loop register anti-dependences:
Each intra-loop register anti-dependence implies a cross-iteration true
dependence of distance 1. Therefore, we can remove such false dependencies
and figure out if the partial schedule broke them by checking if (for a
true-dependence of distance 1): SCHED_TIME (def) < SCHED_TIME (use) and
if so generate a register move. The number of such moves is equal to:
SCHED_TIME (use) - SCHED_TIME (def) { 0 broken
nreg_moves = ----------------------------------- + 1 - { dependence.
ii { 1 if not.
*/
static bool
schedule_reg_moves (partial_schedule_ptr ps)
{
ddg_ptr g = ps->g;
int ii = ps->ii;
int i; for (i = 0; i < g->num_nodes; i++)
{
ddg_node_ptr u = &g->nodes[i];
ddg_edge_ptr e;
int nreg_moves = 0, i_reg_move;
rtx prev_reg, old_reg;
int first_move;
int distances[2];
sbitmap distance1_uses;
rtx set = single_set (u->insn); /* Skip instructions that do not set a register. */
if ((set && !REG_P (SET_DEST (set))))
continue; /* Compute the number of reg_moves needed for u, by looking at life
ranges started at u (excluding self-loops). */
distances[0] = distances[1] = false;
for (e = u->out; e; e = e->next_out)
if (e->type == TRUE_DEP && e->dest != e->src)
{
int nreg_moves4e = (SCHED_TIME (e->dest->cuid)
- SCHED_TIME (e->src->cuid)) / ii; if (e->distance == 1)
nreg_moves4e = (SCHED_TIME (e->dest->cuid)
- SCHED_TIME (e->src->cuid) + ii) / ii; /* If dest precedes src in the schedule of the kernel, then dest
will read before src writes and we can save one reg_copy. */
if (SCHED_ROW (e->dest->cuid) == SCHED_ROW (e->src->cuid)
&& SCHED_COLUMN (e->dest->cuid) < SCHED_COLUMN (e->src->cuid))
nreg_moves4e--; ...... if (nreg_moves4e)
{
gcc_assert (e->distance < 2);
distances[e->distance] = true;
}
nreg_moves = MAX (nreg_moves, nreg_moves4e);
} if (nreg_moves == 0)
continue; /* Create NREG_MOVES register moves. */
first_move = ps->reg_moves.length ();
ps->reg_moves.safe_grow_cleared (first_move + nreg_moves);
extend_node_sched_params (ps); /* Record the moves associated with this node. */
first_move += ps->g->num_nodes; /* Generate each move. */
old_reg = prev_reg = SET_DEST (single_set (u->insn));
for (i_reg_move = 0; i_reg_move < nreg_moves; i_reg_move++)
{
ps_reg_move_info *move = ps_reg_move (ps, first_move + i_reg_move); move->def = i_reg_move > 0 ? first_move + i_reg_move - 1 : i;
move->uses = sbitmap_alloc (first_move + nreg_moves);
move->old_reg = old_reg;
move->new_reg = gen_reg_rtx (GET_MODE (prev_reg));
move->num_consecutive_stages = distances[0] && distances[1] ? 2 : 1;
move->insn = gen_move_insn (move->new_reg, copy_rtx (prev_reg));
bitmap_clear (move->uses); prev_reg = move->new_reg;
} distance1_uses = distances[1] ? sbitmap_alloc (g->num_nodes) : NULL; if (distance1_uses)
bitmap_clear (distance1_uses); /* Every use of the register defined by node may require a different
copy of this register, depending on the time the use is scheduled.
Record which uses require which move results. */
for (e = u->out; e; e = e->next_out)
if (e->type == TRUE_DEP && e->dest != e->src)
{
int dest_copy = (SCHED_TIME (e->dest->cuid)
- SCHED_TIME (e->src->cuid)) / ii; if (e->distance == 1)
dest_copy = (SCHED_TIME (e->dest->cuid)
- SCHED_TIME (e->src->cuid) + ii) / ii; if (SCHED_ROW (e->dest->cuid) == SCHED_ROW (e->src->cuid)
&& SCHED_COLUMN (e->dest->cuid) < SCHED_COLUMN (e->src->cuid))
dest_copy--; if (dest_copy)
{
ps_reg_move_info *move; move = ps_reg_move (ps, first_move + dest_copy - 1);
bitmap_set_bit (move->uses, e->dest->cuid);
if (e->distance == 1)
bitmap_set_bit (distance1_uses, e->dest->cuid);
}
} auto_sbitmap must_follow (first_move + nreg_moves);
for (i_reg_move = 0; i_reg_move < nreg_moves; i_reg_move++)
if (!schedule_reg_move (ps, first_move + i_reg_move,
distance1_uses, must_follow))
break;
if (distance1_uses)
sbitmap_free (distance1_uses);
if (i_reg_move < nreg_moves)
return false;
}
return true;
}
 
2.5. 生成填充部分(prologue)和排空部分(epilogue)

模排程迴圈的核心包含來自不同迭代的指令例項。因此,需要一個prologue和一個epilogue(除非所有動作都是投機的)來保持程式碼的正確性。在生成prologue和epilogue時,如果迴圈邊界未知,則應做特殊處理。一種方法是在prologue的每次迭代中新增一個exit分支,以不同的prologue為目標,這很複雜,並且增加了程式碼的大小;另一種方法是,如果迴圈計數太小而無法到達核心,則保留要執行的迴圈的原始副本,否則執行一個無分支的prologue,緊接著執行核心和 prologue。GCC 實現了後者,因為它更簡單,對程式碼大小的影響更小。

/* Generate the instructions (including reg_moves) for prolog & epilog.  */
static void
generate_prolog_epilog (partial_schedule_ptr ps, struct loop *loop,
rtx count_reg, rtx count_init)
{
int i;
int last_stage = PS_STAGE_COUNT (ps) - 1;
edge e; /* Generate the prolog, inserting its insns on the loop-entry edge. */
start_sequence (); if (!count_init)
{
/* Generate instructions at the beginning of the prolog to
adjust the loop count by STAGE_COUNT. If loop count is constant
(count_init), this constant is adjusted by STAGE_COUNT in
generate_prolog_epilog function. */
rtx sub_reg = NULL_RTX; sub_reg = expand_simple_binop (GET_MODE (count_reg), MINUS, count_reg,
gen_int_mode (last_stage, GET_MODE (count_reg)),
count_reg, 1, OPTAB_DIRECT);
gcc_assert (REG_P (sub_reg));
if (REGNO (sub_reg) != REGNO (count_reg))
emit_move_insn (count_reg, sub_reg);
} for (i = 0; i < last_stage; i++)
duplicate_insns_of_cycles (ps, 0, i, count_reg); /* Put the prolog on the entry edge. */
e = loop_preheader_edge (loop);
split_edge_and_insert (e, get_insns ());
if (!flag_resched_modulo_sched)
e->dest->flags |= BB_DISABLE_SCHEDULE; end_sequence (); /* Generate the epilog, inserting its insns on the loop-exit edge. */
start_sequence (); for (i = 0; i < last_stage; i++)
duplicate_insns_of_cycles (ps, i + 1, last_stage, count_reg); /* Put the epilogue on the exit edge. */
gcc_assert (single_exit (loop));
e = single_exit (loop);
split_edge_and_insert (e, get_insns ());
if (!flag_resched_modulo_sched)
e->dest->flags |= BB_DISABLE_SCHEDULE; end_sequence ();
}
 

參考文獻:

Swing Modulo Scheduling for GCC