1. 程式人生 > >MySQL 8.0 plan optimization 源碼閱讀筆記

MySQL 8.0 plan optimization 源碼閱讀筆記

duplicate isam 源碼剖析 efm bitmap else long 最優 cif

以下基於社區版8.0代碼

預備知識:

  • MySQL JOIN syntax: https://dev.mysql.com/doc/refman/8.0/en/join.html

  • Straight join: is similar to JOIN, except that the left table is always read before the right table. This can be used for those (few) cases for which the join optimizer processes the tables in a suboptimal order. STRAIGHT_JOIN有兩種用法:一種是加在JOIN處作為INNER JOIN的一種特殊類型hint該join的順序;另一種是加在SELECT處使該select下的所有JOIN都強制為用戶table的join順序,從優化代碼上看該用法與semi-join不可同時存在(Optimize_table_order::optimize_straight_join: DBUG_ASSERT(join->select_lex->sj_nests.is_empty())

    )。

  • join order hint: Join-order hints affect the order in which the optimizer joins tables, including JOIN_FIXED_ORDER, JOIN_ORDER, JOIN_PREFIX, JOIN_SUFFIX

  • 各種JOIN類型:INNER JOIN, OUTER JOIN, SEMI JOIN, LEFT/RIGHT JOIN, etc.

  • Materialization(物化): Usually happens in subquery (sometimes known as semi-join). Materialization speeds up query execution by generating a subquery result as a temporary table, normally in memory.

  • Statistics (統計信息):從存儲獲取表的rowcount、min/max/sum/avg/keyrange等元信息,用於輔助plan優化。

  • table dependencies: A LEFT JOIN B : B depends on A and A‘s own dependencies。(待確認 DEPEND JOIN語義是否也是由table dependencies關系表示)

  • table access path: An access path may use either an index scan, a table scan, a range scan or ref access, known as join type in explain.

    • index scan: 一般index scan指的是二級索引scan (MySQL主鍵索引會帶著data存放)

    • table scan: 直接掃表

    • range scan: 對於索引列的一些可能轉化為範圍查詢的條件,MySQL會試圖將其轉化為range scan來減少範圍外無用的scan。單個範圍的range query類似帶range條件下推的index scan或table scan;range query支持抽取出多個範圍查詢。

    • ref: join field是索引,但不是pk或unique not null 索引

    • eq_ref: join field是索引且是pk或unique not null索引,意味著對於每個record最多只會join到右表的一行。

  • MySQL源碼中JOIN對象的tables的存放layout(參考註釋,單看變量名有歧義):

     /**
        Before plan has been created, "tables" denote number of input tables in the
        query block and "primary_tables" is equal to "tables".
        After plan has been created (after JOIN::get_best_combination()),
        the JOIN_TAB objects are enumerated as follows:
        - "tables" gives the total number of allocated JOIN_TAB objects
        - "primary_tables" gives the number of input tables, including
          materialized temporary tables from semi-join operation.
        - "const_tables" are those tables among primary_tables that are detected
          to be constant.
        - "tmp_tables" is 0, 1 or 2 (more if windows) and counts the maximum
          possible number of intermediate tables in post-processing (ie sorting and
          duplicate removal).
          Later, tmp_tables will be adjusted to the correct number of
          intermediate tables, @see JOIN::make_tmp_tables_info.
        - The remaining tables (ie. tables - primary_tables - tmp_tables) are
          input tables to materialized semi-join operations.
        The tables are ordered as follows in the join_tab array:
         1. const primary table
         2. non-const primary tables
         3. intermediate sort/group tables
         4. possible holes in array
         5. semi-joined tables used with materialization strategy
      */
      uint tables;          ///< Total number of tables in query block
      uint primary_tables;  ///< Number of primary input tables in query block
      uint const_tables;    ///< Number of primary tables deemed constant
      uint tmp_tables;      ///< Number of temporary tables used by query

源碼剖析

Join表示一個query的join plan,同時也作為plan的context流轉(因此在para query等一些優化實現中,並行查詢裏除了最上層的父查詢有實際優化的價值外,Join起的作用更像一個context)。

  • best_positions存放最終優化的table order結果。

  • best_read 存放最終cost

  • best_ref 存放輸入的table序列,the optimizer optimizes best_ref

make_join_plan 在JOIN::optimize裏被調用,計算最佳的join order並構建join plan。 Steps:

Here is an overview of the logic of this function:

- Initialize JOIN data structures and setup basic dependencies between tables.

- Update dependencies based on join information. 對於存在outer join或recursive的tables進行關系傳遞propagate_dependencies()(用傳遞閉包算法),構建出完整的依賴關系。(recursive這裏具體指代未確定,nested?WITH RECURSIVE語法?)

- Make key descriptions (update_ref_and_keys()). 這一步驟較為煩雜,本意是想從conditions中找出join連接的condition,並識別出join condition相關的key(key指的就是索引),為後續決定join_type到底是ref/ref_or_null/index等做好準備。但MySQL在這一步又加了不少特殊判斷,比如對key is null的特殊處理等。

- Pull out semi-join tables based on table dependencies.

- Extract tables with zero or one row as const tables. 從這步開始的四個步驟都是const table優化,核心就是先把const table算出來,將變量替換成常量。這裏是依靠獲取采樣判斷const table。

- Read contents of const tables, substitute columns from these tables with
  actual data. Also keep track of empty tables vs. one-row tables.

- After const table extraction based on row count, more tables may
  have become functionally dependent. Extract these as const tables.

- Add new sargable predicates based on retrieved const values.

- Calculate number of rows to be retrieved from each table. 獲取采樣結果的步驟。

- Calculate cost of potential semi-join materializations.

- Calculate best possible join order based on available statistics. 即下文的Optimize_table_order::choose_table_order

- Fill in remaining information for the generated join order.

Statistics

核心對象ha_statistics。最主要的是records表示table rowcount。

class ha_statistics {
  ulonglong data_file_length;     /* Length off data file */
  ulonglong max_data_file_length; /* Length off data file */
  ulonglong index_file_length;
  ulonglong max_index_file_length;
  ulonglong delete_length; /* Free bytes */
  ulonglong auto_increment_value;
  /*
    The number of records in the table.
      0    - means the table has exactly 0 rows
    other  - if (table_flags() & HA_STATS_RECORDS_IS_EXACT)
               the value is the exact number of records in the table
             else
               it is an estimate
  */
  ha_rows records;
  ha_rows deleted;       /* Deleted records */
  ulong mean_rec_length; /* physical reclength */
  /* TODO: create_time should be retrieved from the new DD. Remove this. */
  time_t create_time; /* When table was created */
  ulong check_time;
  ulong update_time;
  uint block_size; /* index block size */

  /*
    number of buffer bytes that native mrr implementation needs,
  */
  uint mrr_length_per_rec;
}

myrocks是在handler::info中更新stats的。而info在除了insert的寫和部分查詢場景會被調用以更新采樣信息(調用處多達十余處)。

/**
    General method to gather info from handler

    ::info() is used to return information to the optimizer.
    SHOW also makes use of this data Another note, if your handler
    doesn't proved exact record count, you will probably want to
    have the following in your code:
    if (records < 2)
      records = 2;
    The reason is that the server will optimize for cases of only a single
    record. If in a table scan you don't know the number of records
    it will probably be better to set records to two so you can return
    as many records as you need.

    Along with records a few more variables you may wish to set are:
      records
      deleted
      data_file_length
      index_file_length
      delete_length
      check_time
    Take a look at the public variables in handler.h for more information.
    See also my_base.h for a full description.

    @param   flag          Specifies what info is requested
  */
  virtual int info(uint flag) = 0;

// 以下為可能的flag對應bit取值。 CONST除了初始化較少用;大部分情況下用VARIABLE,因為VARIABLE涉及的變量確實是較頻繁更新的;ERRKEY在正常路徑不會用到,用來報錯查信息;AUTO專門針對自增值,自增值可從內存裏table級別對象拿到。

/*
  Recalculate loads of constant variables. MyISAM also sets things
  directly on the table share object.

  Check whether this should be fixed since handlers should not
  change things directly on the table object.

  Monty comment: This should NOT be changed!  It's the handlers
  responsibility to correct table->s->keys_xxxx information if keys
  have been disabled.

  The most important parameters set here is records per key on
  all indexes. block_size and primar key ref_length.

  For each index there is an array of rec_per_key.
  As an example if we have an index with three attributes a,b and c
  we will have an array of 3 rec_per_key.
  rec_per_key[0] is an estimate of number of records divided by
  number of unique values of the field a.
  rec_per_key[1] is an estimate of the number of records divided
  by the number of unique combinations of the fields a and b.
  rec_per_key[2] is an estimate of the number of records divided
  by the number of unique combinations of the fields a,b and c.

  Many handlers only set the value of rec_per_key when all fields
  are bound (rec_per_key[2] in the example above).

  If the handler doesn't support statistics, it should set all of the
  above to 0.

  update the 'constant' part of the info:
  handler::max_data_file_length, max_index_file_length, create_time
  sortkey, ref_length, block_size, data_file_name, index_file_name.
  handler::table->s->keys_in_use, keys_for_keyread, rec_per_key
*/
#define HA_STATUS_CONST 8
/*
  update the 'variable' part of the info:
  handler::records, deleted, data_file_length, index_file_length,
  check_time, mean_rec_length
*/
#define HA_STATUS_VARIABLE 16
/*
  This flag is used to get index number of the unique index that
  reported duplicate key.
  update handler::errkey and handler::dupp_ref
  see handler::get_dup_key()
*/
#define HA_STATUS_ERRKEY 32
/*
  update handler::auto_increment_value
*/
#define HA_STATUS_AUTO 64

Join reorder

Optimize_table_order類負責實際的join reorder操作,入口方法為其惟一的public方法 choose_table_order,在make_join_plan中被調用。Optimize_table_order依賴三個前提:

  • tables的依賴關系已經排好序
  • access paths 排好序
  • statistics 采樣已經完成

choose_table_order Steps:

  1. 初始化const_tables的cost,如果全是const_tables則可以直接短路返回

  2. 如果是在一個sjm(semi-join materialization) plan優化過程中,則做一次排序將semi-join(即子查詢的query提前預計算,可根據需求物化)

  3. 否則,非STRAIGHT_JOIN且depend無關的tables是按照row_count從小到大排序的

    if (SELECT_STRAIGHT_JOIN option is set)
      reorder tables so dependent tables come after tables they depend
      on, otherwise keep tables in the order they were specified in the query
    else
      Apply heuristic: pre-sort all access plans with respect to the number of
      records accessed.
    
    Sort algo is merge-sort (tbl >= 5) or insert-sort (tbl < 5)
  4. 如果有where_cond,需要把where_cond涉及的列 遍歷設置到table->cond_set的bitmap中。
  5. STRAIGHT_JOIN的tables優化optimize_straight_join。STRAIGHT_JOIN相當於用戶限定了JOIN的順序,所以此處的優化工作如其註釋所說:Select the best ways to access the tables in a query without reordering them.
  6. 非STRAIGHT_JOIN則使用 啟發式貪心算法greedy_search 進行join reorder。

optimize_straight_join :

  1. 只支持straight_join,DBUG_ASSERT(join->select_lex->sj_nests.is_empty());與semi-join不兼容,只關註primary tables。

  2. 對每個JOIN_TABLE,best_access_path計算其最優的access path,best_access_path通俗的思路概括可參見上面table access pathexplain文檔中關於join types的介紹。

  3. set_prefix_join_cost計算當前表基於對應access path下的cost,並計入總的cost model。Cost計算如下:

    m_row_evaluate_cost = 0.1 // default value
    
    /*
    Cost of accessing the table in course of the entire complete join
        execution, i.e. cost of one access method use (e.g. 'range' or
        'ref' scan ) multiplied by estimated number of rows from tables
        earlier in the join sequence.
    */
    read_cost = get_read_cost(table)
    
    void set_prefix_join_cost(uint idx, const Cost_model_server *cm) {
      if (idx == 0) {
        prefix_rowcount = rows_fetched;
        prefix_cost = read_cost + prefix_rowcount * m_row_evaluate_cost;
      } else {
        // this - 1 means last table
        prefix_rowcount = (this - 1)->prefix_rowcount * rows_fetched;
        prefix_cost = (this - 1)->prefix_cost + read_cost + prefix_rowcount * m_row_evaluate_cost;
      }
      // float filter_effect [0,1] means cond filters in executor may reduce rows. 1 means nothing filtered, 0 means all rows filtered and no rows left. It is used to calculate how many row combinations will be joined with the next table
      prefix_rowcount *= filter_effect;
    }
    

greedy_search

bool Optimize_table_order::best_extension_by_limited_search( table_map remaining_tables, uint idx, uint current_search_depth);
  
procedure greedy_search
    input: remaining_tables
    output: partial_plan;
    {
      partial_plan = <>;
      do {
        (table, a) = best_extension_by_limited_search(partial_plan, remaining_tables, limit_search_depth);
        partial_plan = concat(partial_plan, (table, a));
        remaining_tables = remaining_tables - table;
      } while (remaining_tables != {})
      return pplan;
    }

// 簡單理解就是每一步找拓展出去join路徑最佳的table,按順序加進plan裏面。
// 這種方案會很受選取的第一個表影響(因為第一個表沒有join關系,只能依靠篩選之後的cardinality,一般都是小表),選小表作第一個表不一定是最優選擇。一般Greedy的優化方案會把每個表都當第一個表去評估一次cost,然後從N個cost裏選最小的作為最終plan。MySQL裏只是返回找到的第一個完整的plan。

best_extension_by_limited_search是一個啟發式的搜索過程,search_depth即最大可搜索的深度。best_extension_by_limited_search前半部分邏輯和optimize_straight_join類似:

  1. 計算best_access_path 並計算cost。

  2. 如果此時的cost已經大於best_read,則直接剪枝,無需繼續搜索。

  3. 如果prune_level=PRUNE_BY_TIME_OR_ROWS開啟,則判斷如果best_row_count和best_cost已經大於當前的rows和cost(註意新版本是and關系),且該表沒有被後續的其他表依賴 (可以理解成該表是這個圖路徑上的最後一個節點,所以可以直接prune;但不一定是整個plan的最後一個節點),則將best_row_count和best_cost設為當前的。寫的很繞的代碼,結合整個循環看,大致就是每次找基於(rowcount, cost)二維最優的表,所謂的剪枝實際變成了類似加強貪心。

  4. 對eq_ref做優先的選擇,遇到第一個eq_ref後便遞歸將所有的eq_ref join找出來。(原作者認為eq_ref是個1:1的mapping,所以基本可以認為cost是恒定的,單獨將這個eq_ref的序列提前生成,在後續優化時可以看作一整塊放在任何一個順序位置。當然前提是eq_ref是可以連續的。)

  5. 如果還有remaining_tables,則遞歸繼續處理直至remaining 為空。

MySQL 8.0 plan optimization 源碼閱讀筆記