MySQL -- JOIN
CREATE TABLE `t2` ( `id` INT(11) NOT NULL, `a` INT(11) DEFAULT NULL, `b` INT(11) DEFAULT NULL, PRIMARY KEY (`id`), KEY `a` (`a`) ) ENGINE=InnoDB; DROP PROCEDURE IF EXISTS idata; DELIMITER ;; CREATE PROCEDURE idata() BEGIN DECLARE i INT; SET i=1; WHILE (i <= 1000) DO INSERT INTO t2 VALUES (i,i,i); SET i=i+1; END WHILE; END;; DELIMITER ; CALL idata(); CREATE TABLE t1 LIKE t2; INSERT INTO t1 (SELECT * FROM t2 WHERE id<=100);
Index Nested-Loop Join
-- 使用JOIN,優化器可能會選擇t1或t2作為驅動表 -- 使用STRAIGHT_JOIN,使用固定的連線關係,t1為驅動表,t2為被驅動表 SELECT * FROM t1 STRAIGHT_JOIN t2 ON (t1.a=t2.a); mysql> EXPLAIN SELECT * FROM t1 STRAIGHT_JOIN t2 ON (t1.a=t2.a); +----+-------------+-------+------------+------+---------------+------+---------+-----------+------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key| key_len | ref| rows | filtered | Extra| +----+-------------+-------+------------+------+---------------+------+---------+-----------+------+----------+-------------+ |1 | SIMPLE| t1| NULL| ALL| a| NULL | NULL| NULL|100 |100.00 | Using where | |1 | SIMPLE| t2| NULL| ref| a| a| 5| test.t1.a |1 |100.00 | NULL| +----+-------------+-------+------------+------+---------------+------+---------+-----------+------+----------+-------------+
執行過程
- 從t1讀取一行資料R
- 從R中取出欄位a,然後到t2去查詢
- 取出t2中滿足條件的行,與R組成一行,作為結果集的一部分
- 重複上面步驟,直至遍歷t1完畢
掃描行數
- 對驅動表t1做 全表掃描 ,需要掃描100行
- 對每一行R,根據欄位a去t2查詢,走的是樹 搜尋過程
- 構造的資料都是一一對應,總共掃描100行
- 因此,整個執行流程,總掃描行數為200行
# Time: 2019-03-10T11:06:13.271095Z # User@Host: root[root] @ localhost []Id:8 # Query_time: 0.001391Lock_time: 0.000135 Rows_sent: 100Rows_examined: 200 SET timestamp=1552215973; SELECT * FROM t1 STRAIGHT_JOIN t2 ON (t1.a=t2.a);
不使用Join
- 執行
SELECT * FROM t1
,掃描100行 - 迴圈遍歷100行資料
$R.a SELECT * FROM t2 WHERE a=$R.a
- 對比Join
- 同樣掃描了200行,但總共 執行了101條語句 ,客戶端還需要 自己拼接 SQL語句和結果
選擇驅動表
- 上面的查詢語句, 驅動表走全部掃描 , 被驅動表走樹搜尋
- 假設被驅動表的行數為M
- 每次在被驅動表上查一行資料,需要先搜尋 輔助索引a ,再搜尋 主鍵索引
- 因此,在被驅動表上查一行的時間複雜度是 $2*\log_2 M$
- 假設驅動表的行數為N,需要掃描驅動表N行
- 整個執行過程,時間複雜度為 $N + N*2*\log_2 M$
- N對掃描行數的影響更大,因此選擇 小表做驅動表
Simple Nested-Loop Join
SELECT * FROM t1 STRAIGHT_JOIN t2 ON (t1.a=t2.b);
- 被驅動表t2的欄位b上 沒有索引 ,因此每次到t2去做匹配的時候,都要做一次 全表掃描
- 按照上面的演算法,時間複雜度為 $N + N*M$,總掃描行數為100,100次( 10W )
- 假如t1和t2都是10W行資料,那麼總掃描次數為10,000,100,000次( 100億 )
- 因此,MySQL本身沒有使用
Simple Nested-Loop Join
演算法
Block Nested-Loop Join
針對場景: 被驅動表上沒有可用的索引
join_buffer充足
執行過程
- 把t1的資料讀入執行緒記憶體
join_buffer
,執行的是SELECT *
,因此會把整個t1讀入join_buffer
- 掃描t2,把t2中的每一行取出來,與
join_buffer
中的資料做對比- 如果滿足join條件的行,作為結果集的一部分返回
-- 預設為256KB -- 4194304 Bytes == 4 MB mysql> SHOW VARIABLES LIKE '%join_buffer_size%'; +------------------+---------+ | Variable_name| Value| +------------------+---------+ | join_buffer_size | 4194304 | +------------------+---------+
EXPLAIN
mysql> EXPLAIN SELECT * FROM t1 STRAIGHT_JOIN t2 ON (t1.a=t2.b); +----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+----------------------------------------------------+ | id | select_type | table | partitions | type | possible_keys | key| key_len | ref| rows | filtered | Extra| +----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+----------------------------------------------------+ |1 | SIMPLE| t1| NULL| ALL| a| NULL | NULL| NULL |100 |100.00 | NULL| |1 | SIMPLE| t2| NULL| ALL| NULL| NULL | NULL| NULL | 1000 |10.00 | Using where; Using join buffer (Block Nested Loop) | +----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+----------------------------------------------------+ # Time: 2019-03-10T12:19:57.245356Z # User@Host: root[root] @ localhost []Id:8 # Query_time: 0.010132Lock_time: 0.000192 Rows_sent: 100Rows_examined: 1100 SET timestamp=1552220397; SELECT * FROM t1 STRAIGHT_JOIN t2 ON (t1.a=t2.b);
- 整個過程中,對t1和t2都做了一次 全表掃描 ,總掃描行數為 1100
- 由於
join_buffer
是 以無序陣列 的方式組織的,因此對t2的每一行資料,都需要做100次判斷- 因此,在記憶體中的總判斷次數為100,000次
-
Simple Nested-Loop Join
的掃描行數也是100,000次, 時間複雜度是一樣的- 但
Block Nested-Loop Join
的100,000次判斷是 記憶體操作 , 速度會快很多 -
Simple Nested-Loop Join
可能會涉及 磁碟操作
- 但
選擇驅動表
- 假設小表的行數為N,大表的行數為M
- 兩個表都要做一次 全表掃描 ,總掃描行數為
M+N
- 記憶體中的判斷次數是
M*N
- 此時,選擇大表還是小表作為驅動表, 沒有任何差異
join_buffer不足
-- 放不下t1的所有資料,採取分段放的策略 SET join_buffer_size=1200; # Time: 2019-03-10T12:30:32.194726Z # User@Host: root[root] @ localhost []Id:8 # Query_time: 0.009459Lock_time: 0.000559 Rows_sent: 100Rows_examined: 2100 SET timestamp=1552221032; SELECT * FROM t1 STRAIGHT_JOIN t2 ON (t1.a=t2.b);
執行過程
- 掃描t1,順序讀取資料行放入
join_buffer
,放完第88行後join_buffer
滿,繼續第2步 - 掃描t2,把t2中的每一行取出來,跟
join_buffer
中的資料做對比- 如果滿足join條件的行,作為結果集的一部分返回
- 清空
join_buffer
(為了 複用 ,體現 Block 的核心思想) - 繼續掃描t1,順序取最後12行資料加入
join_buffer
,繼續執行第2步
效能
- 由於t1被分成了兩次加入
join_buffer
,導致t2會被掃描兩次,因此總掃描行數為 2100 - 但是記憶體的判斷次數還是不變的,依然是100,000次
選擇驅動表
- 假設驅動表的資料行數為N,需要分K段才能完成演算法流程,被驅動表的資料行數為M
join_buffer_size
- 掃描行數為 $N + \lambda*N*M$
- 減少N比減少M,掃描的行數會更小
- 因此選擇 小表當驅動表
- 記憶體判斷次數為 $N*M$( 無需考慮 )
- 如果要減少$\lambda$的值,可以加大
join_buffer_size
的值,一次性放入的行越多,分段就越少
小表
-- 恢復為預設值256KB SET join_buffer_size=262144;
過濾行數
t1為驅動表
mysql> EXPLAIN SELECT * FROM t1 STRAIGHT_JOIN t2 ON (t1.b=t2.b) WHERE t2.id<=50; +----+-------------+-------+------------+-------+---------------+---------+---------+------+------+----------+----------------------------------------------------+ | id | select_type | table | partitions | type| possible_keys | key| key_len | ref| rows | filtered | Extra| +----+-------------+-------+------------+-------+---------------+---------+---------+------+------+----------+----------------------------------------------------+ |1 | SIMPLE| t1| NULL| ALL| NULL| NULL| NULL| NULL |100 |100.00 | NULL| |1 | SIMPLE| t2| NULL| range | PRIMARY| PRIMARY | 4| NULL |50 |10.00 | Using where; Using join buffer (Block Nested Loop) | +----+-------------+-------+------------+-------+---------------+---------+---------+------+------+----------+----------------------------------------------------+ # Time: 2019-03-10T13:15:50.346563Z # User@Host: root[root] @ localhost []Id:8 # Query_time: 0.001006Lock_time: 0.000162 Rows_sent: 50Rows_examined: 150 SET timestamp=1552223750; SELECT * FROM t1 STRAIGHT_JOIN t2 ON (t1.b=t2.b) WHERE t2.id<=50;
t2為驅動表
join_buffer
只需要放入t2的前50行,因此 t2的前50行 相對於 t1的所有行 來說是一個 更小的表
mysql> EXPLAIN SELECT * FROM t2 STRAIGHT_JOIN t1 ON (t1.b=t2.b) WHERE t2.id<=50; +----+-------------+-------+------------+-------+---------------+---------+---------+------+------+----------+----------------------------------------------------+ | id | select_type | table | partitions | type| possible_keys | key| key_len | ref| rows | filtered | Extra| +----+-------------+-------+------------+-------+---------------+---------+---------+------+------+----------+----------------------------------------------------+ |1 | SIMPLE| t2| NULL| range | PRIMARY| PRIMARY | 4| NULL |50 |100.00 | Using where| |1 | SIMPLE| t1| NULL| ALL| NULL| NULL| NULL| NULL |100 |10.00 | Using where; Using join buffer (Block Nested Loop) | +----+-------------+-------+------------+-------+---------------+---------+---------+------+------+----------+----------------------------------------------------+ # Time: 2019-03-10T13:18:26.656339Z # User@Host: root[root] @ localhost []Id:8 # Query_time: 0.000965Lock_time: 0.000150 Rows_sent: 50Rows_examined: 150 SET timestamp=1552223906; SELECT * FROM t2 STRAIGHT_JOIN t1 ON (t1.b=t2.b) WHERE t2.id<=50;
優化器選擇
-- 選擇t2作為驅動表 mysql> EXPLAIN SELECT * FROM t1 JOIN t2 ON (t1.b=t2.b) WHERE t2.id<=50; +----+-------------+-------+------------+-------+---------------+---------+---------+------+------+----------+----------------------------------------------------+ | id | select_type | table | partitions | type| possible_keys | key| key_len | ref| rows | filtered | Extra| +----+-------------+-------+------------+-------+---------------+---------+---------+------+------+----------+----------------------------------------------------+ |1 | SIMPLE| t2| NULL| range | PRIMARY| PRIMARY | 4| NULL |50 |100.00 | Using where| |1 | SIMPLE| t1| NULL| ALL| NULL| NULL| NULL| NULL |100 |10.00 | Using where; Using join buffer (Block Nested Loop) | +----+-------------+-------+------------+-------+---------------+---------+---------+------+------+----------+----------------------------------------------------+
列數量
t1為驅動表
t1只查欄位b,如果將t1放入 join_buffer
,只需要放入欄位b的值
mysql> EXPLAIN SELECT t1.b,t2.* FROM t1 STRAIGHT_JOIN t2 ON (t1.b=t2.b) WHERE t2.id<=100; +----+-------------+-------+------------+-------+---------------+---------+---------+------+------+----------+----------------------------------------------------+ | id | select_type | table | partitions | type| possible_keys | key| key_len | ref| rows | filtered | Extra| +----+-------------+-------+------------+-------+---------------+---------+---------+------+------+----------+----------------------------------------------------+ |1 | SIMPLE| t1| NULL| ALL| NULL| NULL| NULL| NULL |100 |100.00 | NULL| |1 | SIMPLE| t2| NULL| range | PRIMARY| PRIMARY | 4| NULL |100 |10.00 | Using where; Using join buffer (Block Nested Loop) | +----+-------------+-------+------------+-------+---------------+---------+---------+------+------+----------+----------------------------------------------------+ # Time: 2019-03-10T13:23:55.558748Z # User@Host: root[root] @ localhost []Id:8 # Query_time: 0.002742Lock_time: 0.000123 Rows_sent: 100Rows_examined: 200 SET timestamp=1552224235; SELECT t1.b,t2.* FROM t1 STRAIGHT_JOIN t2 ON (t1.b=t2.b) WHERE t2.id<=100;
t2為驅動表
t2要查所有的欄位,如果將t2放入 join_buffer
,要放入三個欄位 id
、 a
和 b
,因此t1是 更小的表
mysql> EXPLAIN SELECT t1.b,t2.* FROM t2 STRAIGHT_JOIN t1 on (t1.b=t2.b) WHERE t2.id<=100; +----+-------------+-------+------------+-------+---------------+---------+---------+------+------+----------+----------------------------------------------------+ | id | select_type | table | partitions | type| possible_keys | key| key_len | ref| rows | filtered | Extra| +----+-------------+-------+------------+-------+---------------+---------+---------+------+------+----------+----------------------------------------------------+ |1 | SIMPLE| t2| NULL| range | PRIMARY| PRIMARY | 4| NULL |100 |100.00 | Using where| |1 | SIMPLE| t1| NULL| ALL| NULL| NULL| NULL| NULL |100 |10.00 | Using where; Using join buffer (Block Nested Loop) | +----+-------------+-------+------------+-------+---------------+---------+---------+------+------+----------+----------------------------------------------------+ # Time: 2019-03-10T13:24:51.561116Z # User@Host: root[root] @ localhost []Id:8 # Query_time: 0.002680Lock_time: 0.000907 Rows_sent: 100Rows_examined: 200 SET timestamp=1552224291; SELECT t1.b,t2.* FROM t2 STRAIGHT_JOIN t1 on (t1.b=t2.b) WHERE t2.id<=100;
優化器選擇
-- 但優化器依然選擇了t2作為驅動表 mysql> EXPLAIN SELECT t1.b,t2.* FROM t2 JOIN t1 on (t1.b=t2.b) WHERE t2.id<=100; +----+-------------+-------+------------+-------+---------------+---------+---------+------+------+----------+----------------------------------------------------+ | id | select_type | table | partitions | type| possible_keys | key| key_len | ref| rows | filtered | Extra| +----+-------------+-------+------------+-------+---------------+---------+---------+------+------+----------+----------------------------------------------------+ |1 | SIMPLE| t2| NULL| range | PRIMARY| PRIMARY | 4| NULL |100 |100.00 | Using where| |1 | SIMPLE| t1| NULL| ALL| NULL| NULL| NULL| NULL |100 |10.00 | Using where; Using join buffer (Block Nested Loop) | +----+-------------+-------+------------+-------+---------------+---------+---------+------+------+----------+----------------------------------------------------+
小結
選擇驅動表時,應該是 按照各自的條件過濾 ,然後 計算參與join的各個欄位的總資料量 ,數量量小的表,才是小表
常見問題
- 能否可以使用Join
- 如果使用
Index Nested-Loop Join
,即 用上了被驅動表上的索引 ,其實 問題不大 - 如果使用
Block Nested-Loop Join
, 掃描行數可能會過多 , 儘量避免使用 ,通過EXPLAIN
確認
- 如果使用
- 選擇小表還是大表作為驅動表
- 如果使用
Index Nested-Loop Join
,選擇 小表 作為驅動表 - 如果使用
Block Nested-Loop Join
-
join_buffer
充足時, 沒有區別 -
join_buffer
不足時(更常見),選擇 小表 作為驅動表
-
- 結論: 選擇小表做驅動表
- 如果使用
轉載請註明出處:http://zhongmingmao.me/2019/03/10/mysql-join/
訪問原文「MySQL -- JOIN」獲取最佳閱讀體驗並參與討論