1. 程式人生 > >翻譯之:設計和調整索引

翻譯之:設計和調整索引

文章選自:《Pro SQL Server Internals, 2nd edition》CHAPTER 2 Tables and Indexes

作者:Dmitri Korotkevitch

第七章

設計和調整索引

It is impossible to define an indexing strategy that will work everywhere. Every system is unique and requires its own indexing approach based on workload, business requirements, and quite a few other factors. However, there are several design considerations and guidelines that can be applied in every system.

我們無法定義可在任何地方使用的索引策略。 每個系統都是獨一無二的,需要基於工作負載,業務需求和其他一些因素定義自己的索引方法。但是,還是有幾個設計考慮因素和指南可以應用於每個系統。 

   The same is true when we are optimizing existing systems. While optimization is an iterative process that is unique in every case, there is a set of techniques that can be used to detect inefficiencies in every database system.

當我們優化現有系統時也是如此。 雖然優化在每種情況下都是一個都是獨特的迭代過程,但是還是有一組技術可用於檢測出每個資料庫系統中的低效率情況。

   In this chapter, we will cover a few important factors that you will need to keep in mind when designing new indexes and optimizing existing systems.

在本章中,我們將介紹在設計新索引和優化現有系統時需要記住的幾個重要因素。

Clustered Index Design Considerations

聚集索引設計注意事項

Every time you change the value of a clustered index key, two things happen. First, SQL Server moves the row to a different place in the clustered index page chain and in the data files. Second, it updates the row-id ,which is the clustered index key. The row-id is stored and needs to be updated in all nonclustered indexes. That can be expensive in terms of I/O, especially in the case of batch updates. Moreover, it can increase the fragmentation of the clustered index and, in cases of row-id size increase, of the nonclustered indexes. Thus, it is better to have a static clustered index where key values do not change.

每次更改聚集索引鍵的值時,都會發生兩件事。 首先,SQL Server將行移動到聚集索引頁鏈和資料檔案中的不同位置。 其次,它更新了row-id,它是聚集索引鍵。 儲存了行id,需要在所有非聚集索引中更新。 就I / O而言,這可能是昂貴的,特別是在批量更新的情況下。 此外,它可以增加聚集索引的碎片,並且在row-id大小增加的情況下,可以增加非聚集索引的碎片。 因此,最好有一個靜態聚集索引,其中鍵值不會改變。

  All nonclustered indexes use a clustered index key as the row-id . A too-wide clustered index key increases the size of nonclustered index rows and requires more space to store them. As a result, SQL Server needs to process more data pages during index- or range-scan operations, which makes the index less efficient.

所有非聚集索引都使用聚集索引鍵作為row-id。 過長的聚集索引鍵會增加非聚集索引行的大小,並且需要更多空間來儲存它們。 因此,SQL Server需要在索引或範圍掃描操作期間處理更多資料頁,這會降低索引的效率。

  In cases of non-unique nonclustered indexes, the row-id is also stored at non-leaf index levels, which, in turn, reduces the number of index records per page and can lead to extra intermediate levels in the index. Even though non-leaf index levels are usually cached in memory, this introduces additional logical readsevery time SQL Server traverses the nonclustered index B-Tree.

在非唯一的非聚集索引的情況下,row-id也儲存在非葉索引級別,這反過來會減少每頁索引記錄的數量,並可能導致索引中的額外中間級別。 儘管非葉索引級別通常快取在記憶體中,但這會引入額外的邏輯讀取SQL Server遍歷非聚集索引B-Tree。

  Finally, larger nonclustered indexes use more space in the buffer pool and introduce more overhead during index maintenance. Obviously, it is impossible to provide a generic threshold value that defines the maximum acceptable size of a key that can be applied to any table. However, as a general rule, it is better to have a narrow clustered index key, with the index key as small as possible.

最後,較大的非聚集索引在緩衝池中佔用更多空間,並在索引維護期間引入更多開銷。 顯然,不可能提供一個通用閾值來定義可應用於任何表的金鑰的最大可接受大小。 但是,作為一般規則,最好使用窄聚集索引鍵,索引鍵儘可能小。

  It is also beneficial to have the clustered index be defined as unique . The reason this is important is not obvious. Consider a scenario in which a table does not have a unique clustered index and you want to run a query that uses a nonclustered index seek in the execution plan. In this case, if the row-id in the nonclustered index were not unique, SQL Server would not know what clustered index row to choose during the key lookup operation.

定義唯一的聚集索引也是有益的。但這重要的原因表現得並不明顯。 考慮這樣一種情況,其中表沒有唯一的聚簇索引,並且您希望在執行計劃中執行使用非聚簇索引查詢的查詢。 在這種情況下,如果非聚簇索引中的row-id不是唯一的,則SQL Server將不知道在鍵查詢操作期間要選擇哪個聚簇索引行。

  SQL Server solves such problems by adding another nullable integer column called uniquifier to nonunique clustered indexes. SQL Server populates uniquifiers with NULL for the first occurrence of the key value, autoincrementing it for each subsequent duplicate inserted into the table.

Note The number of possible duplicates per clustered index key value is limited by integer domain values. You cannot have more than 2,147,483,648 rows with the same clustered index key. This is a theoretical limit, and it is clearly a bad idea to create indexes with such poor selectivity.

注意每個聚集索引鍵值的可能重複項數量受整數域值的限制。具有相同聚簇索引鍵的行不能超過2,147,483,648。這是理論上的限制,建立具有如此差的選擇性索引顯然不是個好主意。

  Let’s look at the overhead introduced by uniquifiers in non-unique clustered indexes. The code shown in Listing 7-1 creates three different tables of the same structure and populates them with 65,536 rows each. Table dbo.UniqueCI is the only table with a unique clustered index defined. Table dbo.NonUniqueCINoDups does not have any duplicated key values. Finally, table dbo.NonUniqueCDups has a large number of duplicates in the index.

讓我們看看非唯一聚簇索引中的uniquifiers引入的開銷。 清單7-1中顯示的程式碼建立了三個具有相同結構的不同表,其中共有65,536行。 表dbo.UniqueCI是唯一定義了唯一聚簇索引的表。 表dbo.NonUniqueCINoDups沒有任何重複的鍵值。 最後,表dbo.NonUniqueCDups在索引中有大量重複項。

Listing 7-1. Nonunique clustered index: Table creation

清單7-1。 非唯一聚簇索引:表建立

create table dbo.UniqueCI(建立表dbo.UniqueCI)

(

KeyValue int not null,

ID int not null,

Data char(986) null,

VarData varchar(32) not null

constraint DEF_UniqueCI_VarData

default 'Data'

);

create unique clustered index IDX_UniqueCI_KeyValue

on dbo.UniqueCI(KeyValue);

create table dbo.NonUniqueCINoDups(建立表dbo.NonUniqueCINoDups)

(

KeyValue int not null,

ID int not null,

Data char(986) null,

VarData varchar(32) not null

constraint DEF_NonUniqueCINoDups_VarData

default 'Data'

);

create /*unique*/ clustered index IDX_NonUniqueCINoDups_KeyValue

on dbo.NonUniqueCINoDups(KeyValue);

create table dbo.NonUniqueCIDups(建立表dbo.NonUniqueCIDups)

(

KeyValue int not null,

ID int not null,

Data char(986) null,

VarData varchar(32) not null

constraint DEF_NonUniqueCIDups_VarData

default 'Data'

);

create /*unique*/ clustered index IDX_NonUniqueCIDups_KeyValue

on dbo.NonUniqueCIDups(KeyValue); 

-- Populating data(填充資料)

;with N1(C) as (select 0 union all select 0) -- 2 rows

,N2(C) as (select 0 from N1 as T1 cross join N1 as T2) -- 4 rows

,N3(C) as (select 0 from N2 as T1 cross join N2 as T2) -- 16 rows

,N4(C) as (select 0 from N3 as T1 cross join N3 as T2) -- 256 rows

,N5(C) as (select 0 from N4 as T1 cross join N4 as T2) -- 65,536 rows

,IDs(ID) as (select row_number() over (order by (select null)) from N5)

insert into dbo.UniqueCI(KeyValue, ID)

select ID, ID from IDs;

insert into dbo.NonUniqueCINoDups(KeyValue, ID)

select KeyValue, ID from dbo.UniqueCI;

insert into dbo.NonUniqueCIDups(KeyValue, ID)

select KeyValue % 10, ID from dbo.UniqueCI;

 

Now, let’s look at the clustered indexes’ physical statistics for each table. The code for this is shown in Listing 7-2 , and the results are shown in Figure 7-1 .

現在,讓我們看一下每個表的聚簇索引的物理統計資訊。程式碼如清單7-2所示,結果如圖7-1所示。

 

Listing 7-2. Nonunique clustered index : Checking clustered indexes’ row sizes

清單7-2 非唯一聚簇索引:檢查聚簇索引的行大小

select index_level, page_count, min_record_size_in_bytes as [min row size]

,max_record_size_in_bytes as [max row size]

,avg_record_size_in_bytes as [avg row size]

from

sys.dm_db_index_physical_stats(db_id(), object_id(N'dbo.UniqueCI'), 1, null ,'DETAILED');

select index_level, page_count, min_record_size_in_bytes as [min row size]

,max_record_size_in_bytes as [max row size]

, avg_record_size_in_bytes as [avg row size]

from

sys. dm_db_index_physical_stats(db_id(), object_id(N'dbo.NonUniqueCINoDups'), 1, null

,'DETAILED');

select index_level, page_count, min_record_size_in_bytes as [min row size]

,max_record_size_in_bytes as [max row size]

,avg_record_size_in_bytes as [avg row size]

from

sys. dm_db_index_physical_stats(db_id(), object_id(N'dbo.NonUniqueCIDups'), 1, null

,'DETAILED');

 

 

Figure 7-1. Nonunique clustered index: Clustered indexes’ row size

圖7-1  非唯一聚簇索引:聚簇索引的行大小

  Even though there are no duplicated key values in the dbo.NonUniqueCINoDups table, there are still two extra bytes added to the row. SQL Server stores a uniquifier in the variable-length section of the data, and those two bytes are added by yet another entry in a variable-length data offset array.

即使dbo.NonUniqueCINoDups表中沒有重複的鍵值,仍然有兩個額外的位元組新增到該行。 SQL Server將一個uniquifier儲存在資料的可變長度部分中,並且這兩個位元組由可變長度資料偏移陣列中的另一個條目新增。

  In the case, when a clustered index has duplicate values, uniquifiers add yet another four bytes, which makes for an overhead of six bytes total.

在這種情況下,當聚簇索引具有重複值時,uniquifiers會再新增另外四個位元組,這會產生總共六個位元組的開銷。

  It is worth mentioning that in some edge cases, the extra storage space used by the uniquifier can reduce the number of rows that can fit onto the data page. Our example demonstrates such a condition. As you can see, dbo.UniqueCI uses about 15 percent fewer data pages than the other two tables.

值得一提的是,在某些邊緣情況下,uniquifier使用的額外儲存空間可以減少可以放入資料頁面的行數。 我們的例子說明了這種情況。 如您所見,dbo.UniqueCI使用的資料頁數比其他兩個表少15%。

  Now, let’s see how the uniquifier affects nonclustered indexes. The code shown in Listing 7-3 creates nonclustered indexes in all three tables. Figure 7-2 shows the physical statistics for those indexes.

現在,讓我們看看uniquifier如何影響非聚簇索引。 清單7-3中顯示的程式碼在所有三個表中建立非聚簇索引。 圖7-2顯示了這些索引的物理統計資訊。

Listing 7-3. Nonunique clustered index : Checking nonclustered indexes’ row size

清單7-3。 非唯一聚簇索引:檢查非聚簇索引的行大小

create nonclustered index IDX_UniqueCI_ID

on dbo.UniqueCI(ID);

create nonclustered index IDX_NonUniqueCINoDups_ID

on dbo.NonUniqueCINoDups(ID);

create nonclustered index IDX_NonUniqueCIDups_ID

on dbo.NonUniqueCIDups(ID);

select index_level, page_count, min_record_size_in_bytes as [min row size]

,max_record_size_in_bytes as [max row size]

,avg_record_size_in_bytes as [avg row size]

from

sys. dm_db_index_physical_stats(db_id(), object_id(N'dbo.UniqueCI'), 2, null

,'DETAILED');

select index_level, page_count, min_record_size_in_bytes as [min row size]

,max_record_size_in_bytes as [max row size]

,avg_record_size_in_bytes as [avg row size]

from

sys. dm_db_index_physical_stats(db_id(), object_id(N'dbo.NonUniqueCINoDups'), 2, null

,'DETAILED');

select index_level, page_count, min_record_size_in_bytes as [min row size]

,max_record_size_in_bytes as [max row size]

,avg_record_size_in_bytes as [avg row size]

from

sys. dm_db_index_physical_stats(db_id(), object_id(N'dbo.NonUniqueCIDups'), 2, null

,'DETAILED');

 

 

Figure 7-2. Nonunique clustered index: Nonclustered indexes’ row size

圖7-2。 非唯一聚簇索引:非聚簇索引的行大小

 

There is no overhead in the nonclustered index in the dbo.NonUniqueCINoDups table. As you will recall, SQL Server does not store offset information in a variable-length offset array for trailing columns storing NULL data. Nonetheless, the uniquifier introduces eight bytes of overhead in the dbo.NonUniqueCIDups table. Those eight bytes consist of a four-byte uniquifier value, a two-byte variable-length data offset array entry, and a two-byte entry storing the number of variable-length columns in the row.

dbo.NonUniqueCINoDups表中的非聚集索引沒有開銷。 你們可能還記得,SQL Server不會將偏移量資訊儲存在可變長度偏移陣列中,以用於儲存NULL資料的尾隨列。 儘管如此,uniquifier在dbo.NonUniqueCIDups表中引入了8個位元組的開銷。 這八個位元組由一個四位元組的unquifier值,一個雙位元組的可變長度資料偏移陣列條目和一個儲存行中可變長度列數的雙位元組條目組成。

  We can summarize the storage overhead of the uniquifier in the following way. For the rows that have a uniquifier as NULL , there is a two-byte overhead if the index has at least one variable-length column that stores a NOT NULL value. That overhead comes from the variable-length offset array entry for the uniquifier column. There is no overhead otherwise.

我們可以通過以下方式總結uniquifier的儲存開銷。 對於具有uniquifier為NULL的行,如果索引至少有一個儲存NOT NULL值的可變長度列,則會產生兩個位元組的開銷。 該開銷來自uniquifier列的可變長度偏移陣列條目。 否則沒有開銷。

    In cases where the uniquifier is populated, the overhead is six bytes if there are variable-length columns that store NOT NULL values. Otherwise, the overhead is eight bytes.

在填充uniquifier的情況下,如果存在儲存NOT NULL值的可變長度列,則開銷為六個位元組。 否則,開銷是八個位元組。

Tip If you expect a large number of duplicates in the clustered index values, you can add an integer identity column as the rightmost column to the index, thereby making it unique. This adds a four-byte predictable storage overhead to every row as compared to an unpredictable up to eight-byte storage overhead introduced by uniquifiers. This can also improve the performance of individual lookup operations when you reference the row by all of its clustered index columns.

提示如果預計聚簇索引值中存在大量重複項,則可以將整數標識列作為索引的最右列,從而使其唯一。 與由uniquifiers引入的不可預測的高達8位元組的儲存開銷相比,這為每一行增加了四位元組可預測的儲存開銷。 當您通過其所有聚簇索引列引用該行時,這還可以提高單個查詢操作的效能。

  It is beneficial to design clustered indexes in a way that minimizes index fragmentation caused by inserting new rows. One of the methods to accomplish this is by making clustered index values ever increasing . The index on the identity column is one such example. Another example is a datetime column populated with the current system time at the moment of insertion.

以最小化插入新行導致的索引碎片的方式設計聚簇索引是比較有益的。 實現此目標的方法之一是使聚簇索引值不斷增加。 標識列上的索引就是一個這樣的例子。 另一個示例是使用插入時的當前系統時間填充的日期時間列。

  There are two potential issues with ever-increasing indexes, however. The first relates to statistics. As you learned in Chapter 3 , the legacy cardinality estimator in SQL Server underestimates cardinality when parameter values are not present in the histogram. You should factor such behavior into your statistics maintenance strategy for the system, unless you are using the new SQL Server 2014-2016 cardinality estimators, which assume that data outside of the histogram has distributions similar to those of other data in the table.

然而,不斷增加的指數存在兩個潛在的問題。 第一個涉及統計問題。 正如您在第3章中學到的,當直方圖中不存在引數值時,SQL Server中的遺留基數估計器會低估基數。 您應該將此類行為納入系統的統計資訊維護策略,除非您使用新的SQL Server 2014-2016基數估算器,該估算器假定直方圖之外的資料具有與表中其他資料類似的分佈。

  The next problem is more complicated. With ever-increasing indexes, the data is always inserted at the end of the index. On the one hand, it prevents page splits and reduces fragmentation. On the other hand, it can lead to hot spots , which are serialization delays that occur when multiple sessions are trying to modify the same data page and/or allocate new pages or extents. SQL Server does not allow multiple sessions to update the same data structures, and instead serializes those operations.

下一個問題更復雜。 隨著索引的不斷增加,資料總是插入到索引的末尾。 一方面,它可以防止頁面拆分並減少碎片。 另一方面,它可能導致熱點,這是當多個會話試圖修改相同資料頁和/或分配新頁面或範圍時發生的序列化延遲。 SQL Server不允許多個會話更新相同的資料結構,而是應該序列化這些操作。

  Hot spots are usually not an issue unless a system collects data at a very high rate and the index handles hundreds of inserts per second. We will discuss how to detect such an issue in Chapter 27 , “System Troubleshooting.”

除非系統以非常高的速率收集資料並且索引每秒處理數百個插入,否則熱點通常不是問題。 我們將在第27章“系統故障排除”中討論如何檢測此類問題。 

  Finally, if a system has a set of frequently executed and important queries, it might be beneficial to consider a clustered index, which optimizes them. This eliminates expensive key lookup operations and improves the performance of the system.

最後,如果系統具有一組頻繁執行且重要的查詢,則考慮聚集索引可能是有益的,這會優化它們。 這消除了昂貴的金鑰查詢操作並提高了系統的效能。

  Even though such queries can be optimized by using covering nonclustered indexes, it is not always the ideal solution. In some cases, it requires you to create very wide nonclustered indexes, which will use up a lot of storage space both on disk and in the buffer pool.

即使可以使用覆蓋非聚簇索引來優化此類查詢,但它並不總是理想的解決方案。 在某些情況下,它需要您建立非常寬的非聚簇索引,這將佔用磁碟和緩衝池中的大量儲存空間。

  Another important factor is how often columns are modified. Adding frequently modified columns to nonclustered indexes requires SQL Server to change data in multiple places, which negatively affects the update performance of the system and increases blocking.

另一個重要因素是修改列的頻率。 將經常修改的列新增到非聚簇索引需要SQL Server在多個位置更改資料,這會對系統的更新效能產生負面影響並增加阻塞。

  With all that being said, it is not always possible to design clustered indexes that will satisfy all of these guidelines. Moreover, you should not consider these guidelines to be absolute requirements. You should analyze the system, business requirements, workload, and queries and choose clustered indexes that would benefit you, even if they violate some of those guidelines.

儘管如此,並不總是能夠設計滿足所有這些準則的聚簇索引。 此外,不應將這些指南視為絕對要求。應該分析系統,業務需求,工作負載和查詢,並選擇有益於您的聚簇索引,即使它們違反了某些準則。

Identities, Sequences, and Uniqueidentifiers

身份,序列和唯一識別符號

People often choose identities, sequences, and uniqueidentifiers as clustered index keys. As always, that approach has its own set of pros and cons.

人們通常選擇身份,序列和唯一識別符號作為聚簇索引鍵。 與往常一樣,這種方法有其自身的優缺點。

  Clustered indexes defined on such columns are unique , static, and narrow . Moreover, identities and sequences are ever increasing, which reduces index fragmentation. One of the ideal use cases for them is catalog entity tables. You can think about tables, which store lists of customers, articles, or devices, as an example. Those tables store thousands, or maybe even a few million, rows, although the data is relatively static, and, as a result, hot spots are not an issue. Moreover, such tables are usually referenced by foreign keys and used in joins. Indexes on integer or bigint columns are very compact and efficient, which will improve the performance of queries.

在此類列上定義的聚簇索引是唯一的,靜態的和窄的。 此外,身份和序列不斷增加,這減少了索引碎片。 其中一個理想的用例是目錄實體表。 作為示例,您可以考慮儲存客戶,文章或裝置列表的表。 這些表儲存數千甚至數百萬行,儘管資料相對靜態,因此熱點不是問題。 此外,這些表通常由外來鍵引用並用於連線。 integer或bigint列上的索引非常緊湊和高效,這將提高查詢的效能。

Note We will discuss foreign key constraints in greater detail in Chapter 8 , “Constraints.”

注意我們將在第8章“約束”中更詳細地討論外來鍵約束。

      Clustered indexes  on  identity  or  sequence  columns are less efficient in the case of transactional tables, which collect large amounts of data at a very high rate, due to the potential hot spots they introduce.  

在事務表的情況下,身份或序列列上的聚集索引效率較低,由於事務表引入的潛在焦點而以非常高的速率收集大量資料。

  Uniqueidentifiers,  on the other hand, are rarely a good choice for indexes, both clustered and nonclustered. Random values generated with the  NEWID()  function greatly increase index fragmentation. Moreover, indexes on uniqueidentifiers decrease the performance of batch operations. Let’s look at an example and create two tables: one with clustered indexes on  identity  columns and one with clustered indexes on  uniqueidentifier  columns. In the next step, we will insert 65,536 rows into both tables. You can see the code for doing this in Listing  7-4 .   

另一方面,唯一識別符號很少是聚集和非聚集索引的理想選擇。 使用NEWID()函式生成的隨機值極大地增加了索引碎片。 此外,唯一識別符號上的索引會降低批處理操作的效能。 讓我們看一個示例並建立兩個表:一個表在標識列上有聚集索引,另一個在唯一識別符號列上有聚簇索引。 在下一步中,我們將在兩個表中插入65,536行。 您可以在清單7-4中看到執行此操作的程式碼。

 Listing 7-4.    Uniqueidentifiers: Table creation  

清單7-4。 唯一識別符號:建立表

create table dbo.IdentityCI

 (

     ID int not null identity(1,1),

     Val int not null,

     Placeholder char(100) null

 );

 

create unique clustered index IDX_IdentityCI_ID

 on dbo.IdentityCI(ID);

 

create table dbo.UniqueidentifierCI

 (

     ID uniqueidentifier not null

         constraint DEF_UniqueidentifierCI_ID

         default newid(),  

     Val int not null,

     Placeholder char(100) null,

 );

 

create unique clustered index IDX_UniqueidentifierCI_ID

 on dbo.UniqueidentifierCI(ID)

 go

 

;with N1(C) as (select 0 union all select 0) -- 2 rows

 ,N2(C) as (select 0 from N1 as T1 cross join N1 as T2) -- 4 rows

 ,N3(C) as (select 0 from N2 as T1 cross join N2 as T2) -- 16 rows

 ,N4(C) as (select 0 from N3 as T1 cross join N3 as T2) -- 256 rows

 ,N5(C) as (select 0 from N4 as T1 cross join N4 as T2) -- 65,536 rows

 ,IDs(ID) as (select row_number() over (order by (select null)) from N5)

 insert into dbo.IdentityCI(Val)

     select ID from IDs;

 

;with N1(C) as (select 0 union all select 0) -- 2 rows

 ,N2(C) as (select 0 from N1 as T1 cross join N1 as T2) -- 4 rows

 ,N3(C) as (select 0 from N2 as T1 cross join N2 as T2) -- 16 rows

 ,N4(C) as (select 0 from N3 as T1 cross join N3 as T2) -- 256 rows

 ,N5(C) as (select 0 from N4 as T1 cross join N4 as T2) -- 65,536 rows

 ,IDs(ID) as (select row_number() over (order by (select null)) from N5)

 insert into dbo.UniqueidentifierCI(Val)

  select ID from IDs;

The execution time on my computer and number of reads are shown in Table  7-1 .

我的計算機上的執行時間和讀取次數如表7-1所示。

Table 7-1.    Inserting Data into the Tables: Execution Statistics  

表7-1。 將資料插入表中:執行統計讀取執行時間(ms)

 

F igure 7-3.   Inserting data into the tables: Execution plans   

圖7-3。 將資料插入表中:執行計劃  

As you can see, there is another sort operator in the case of the index on the  uniqueidentifier  column. SQL Server sorts randomly generated  uniqueidentifier  values before the insert, which decreases the performance of the query.  Let’s insert another batch of rows into the table and check index fragmentation. The code for doing this is shown in Listing  7-5 . Figure  7-4  shows the results of the queries.

如您所見,唯一識別符號列上的索引有另一個排序運算子。 SQL Server在插入之前對隨機生成的uniqueidentifier值進行排序,這會降低查詢的效能。 讓我們在表中插入另一批行並檢查索引碎片。 執行此操作的程式碼如清單7-5所示。 圖7-4顯示了查詢的結果。   

 Listing 7-5.     Uniqueidentifiers  : Inserting rows and checking fragmentation  

清單7-5.唯一識別符號:插入行並檢查碎片

;with N1(C) as (select 0 union all select 0) -- 2 rows

 ,N2(C) as (select 0 from N1 as T1 cross join N1 as T2) -- 4 rows

 ,N3(C) as (select 0 from N2 as T1 cross join N2 as T2) -- 16 rows

 ,N4(C) as (select 0 from N3 as T1 cross join N3 as T2) -- 256 rows

 ,N5(C) as (select 0 from N4 as T1 cross join N4 as T2) -- 65,536 rows

 ,IDs(ID) as (select row_number() over (order by (select null)) from N5)

 insert into dbo.IdentityCI(Val)

     select ID from IDs;

 

;with N1(C) as (select 0 union all select 0) -- 2 rows

 ,N2(C) as (select 0 from N1 as T1 cross join N1 as T2) -- 4 rows

 ,N3(C) as (select 0 from N2 as T1 cross join N2 as T2) -- 16 rows

 ,N4(C) as (select 0 from N3 as T1 cross join N3 as T2) -- 256 rows

 ,N5(C) as (select 0 from N4 as T1 cross join N4 as T2) -- 65,536 rows

 ,IDs(ID) as (select row_number() over (order by (select null)) from N5)

 

insert into dbo.UniqueidentifierCI(Val)

 select ID from IDs;

 

select page_count, avg_page_space_used_in_percent, avg_fragmentation_in_percent

 from sys.dm_db_index_physical_stats(db_id(),object_id(N'dbo.IdentityCI'),1,null,'DETAILED');

 

select page_count, avg_page_space_used_in_percent, avg_fragmentation_in_percent

from  sys.dm_db_index_physical_stats(db_id(),object_id(N'dbo.UniqueidentifierCI'),1,null ,'DETAILED');

 

 

F igure 7-4.    Fragmentation of the indexes 

圖7-4.索引碎片    

As you can see, the index on the  uniqueidentifier  column is heavily fragmented, and it uses about 40 percent more data pages as compared to the index on the  identity  column.  A batch insert into the index on the  uniqueidentifier  column inserts data at different places in the data file, which leads to heavy, random physical I/O in the case of large tables. This can significantly decrease the performance of the operation.

如您所見,唯一識別符號列上的索引嚴重碎片化,它比標識列上的索引大約多使用40%的資料頁數。在唯一識別符號列的索引在資料檔案的不同位置中批量插入資料,在大型表的情況下出現繁重的隨機物理I / O.這可能會顯著降低操作效能。

PERSONAL EXPERIENCE

個人經驗

        Some time ago, I had been involved in the optimization of a system that had a 250 GB table with one clustered and three nonclustered indexes. One of the nonclustered indexes was the index on the uniqueidentifier  column. By removing this index, we were able to speed up a batch insert of 50,000 rows from 45 seconds down to 7 seconds. 

 前段時間,我參與了一個系統的優化,該系統具有250 GB的表,其中包含一個聚集索引和三個非聚簇索引。其中一個非聚集索引就是索引唯一識別符號列。通過刪除此索引,我們能夠將50,000行的批量插入從45秒加速到7秒。

        There are two common use cases for when you would  want   to create indexes on  uniqueidentifier  columns. The first one is for supporting the uniqueness of values across multiple databases. Think about a distributed system where rows can be inserted into every database. Developers often use uniqueidentifiers to make sure that every key value is unique system wide. 

 當您想要在唯一表示符列上建立索引時,有兩種常見用例。第一個是支援跨多個數據庫的建立唯一的值。在分散式系統可以將行插入每個資料庫。開發人員經常使用唯一識別符號來確保每個鍵值在系統範圍內都是唯一的。

 The key element in such an implementation is how key values were generated. As you have already seen, the random values generated with the  NEWID()  function or in the client code negatively affect system performance. However, you can use the  NEWSEQUENTIALID()  function, which generates unique and  generally  ever-increasing values (SQL Server resets their base value from time to time). Indexes on  uniqueidentifier  columns generated with the  NEWSEQUENTIALID()  function are similar to indexes on  identity  and  sequence  columns; however, you should remember that the  uniqueidentifier  data type uses 16 bytes of storage space, compared to the 4-byte  int  or 8-byte  bigint  data types.

此類實現中的關鍵元素是如何生成鍵值。正如您看到的,使用NEWID()函式或客戶端程式碼生成的隨機值會對系統性能產生負面影響。但是,您可以使用NEWSEQUENTIALID()函式,該函式生成唯一且通常不斷增加值(SQL Server會時不時重置其基值)。使用NEWSEQUENTIALID()函式生成的唯一識別符號列的索引類似於identity和sequence列的索引;但是,您應該記住,唯一識別符號資料型別使用16位元組的儲存空間,而4位元組的int或8位元組的bigint資料型別。

 As an alternative solution, you may consider creating a composite index with two columns

 (InstallationId, Unique_Id_Within_Installation).  The combination of these two columns guarantees uniqueness across multiple installations and databases and uses less storage space than uniqueidentifiers do. You can use an integer identity or sequence to generate the  Unique_Id_Within_Installation  value, which will reduce the fragmentation of the index.

作為替代解決方案,您可以考慮建立具有兩列的複合索引(InstallationId,Unique_Id_Within_Installation)。這兩列的組合保證了安裝多個數據庫的唯一性,並且比唯一

識別符號使用更少的儲存空間。您可以使用整數標識或序列來生成Unique_Id_Within_Installation值,這將減少索引的碎片。

  In cases where you need to generate unique key values across all entities in the database, you can consider using a single sequence object across all entities. This approach fulfils the requirement but uses a smaller data type than  uniqueidentifiers .

如果需要在資料庫中的所有實體上生成唯一鍵值,則可以考慮在所有實體中使用單個序列物件。此方法滿足要求但要使用比唯一識別符號更小的資料型別。

 Another common use case is security, where a uniqueidentifier value is used as a security token or a random object ID. Unfortunately, you cannot use the   NEWSEQUENTIALID()  function   in this scenario, because it is possible to guess the next value returned by that function. 

另一個常見用例是安全性,其中唯一識別符號值用作安全性令牌或隨機物件ID。不幸的是,您無法在此方案中使用NEWSEQUENTIALID()函式,因為可以猜測該函式返回的下一個值。

 One possible improvement in this scenario is creating a calculated column using the  CHECKSUM()  function, indexing it afterward without creating the index on the  uniqueidentifier  column. The code is shown in Listing  7-6 .  

在這種情況下,一種可能的改進是使用CHECKSUM()函式建立計算列,然後對其進行索引,而不在唯一識別符號列上建立索引。程式碼如清單7-6所示。

 Listing 7-6.    Using CHECKSUM(): Table structure  

  清單7-6。使用CHECKSUM():表結構

create table dbo.Articles

 (

     ArticleId int not null identity(1,1),

     ExternalId uniqueidentifier not null

         constraint DEF_Articles_ExternalId

         default newid(),

     ExternalIdCheckSum as checksum(ExternalId),

     /* Other Columns */

 );

 

create unique clustered index IDX_Articles_ArticleId

 on dbo.Articles(ArticleId);

 

create nonclustered index IDX_Articles_ExternalIdCheckSum

 on dbo.Articles(ExternalIdCheckSum);

 ■ Tip   You can index a calculated column without persisting it. 

注意:您可以索引計算列而不保留它。

Even though the  IDX_Articles_ExternalIdCheckSum  index is going to be heavily fragmented, it will be more compact as compared to the index on the  uniqueidentifier  column (a 4-byte key versus 16 bytes). It also improves the performance of batch operations because of faster sorting, which also requires less memory to proceed.  One thing that you must keep in mind is that the result of the  CHECKSUM()  function is not guaranteed to be unique. You should include both predicates to the queries, as shown in Listing  7-7 .

儘管IDX_Articles_ExternalIdCheckSum索引將嚴重分段,但與唯一識別符號列上的索引(4位元組金鑰與16位元組)相比,它將更緊湊。 它還提高了批處理操作的效能,因為更快的排序,這也需要更少的記憶體來進行。 您必須記住的一件事是CHECKSUM()函式的結果不保證是唯一的。 您應該在查詢中包含兩個謂詞,如清單7-7所示。   

 Listing 7-7.    Using CHECKSUM(): Selecting data  

      清單7-7。 使用CHECKSUM():選擇資料

select ArticleId /* Other Columns */

 from dbo.Articles

 where checksum(@ExternalId) = ExternalIdCheckSum and ExternalId = @ExternalId

■Tip    You can use the same technique in cases where you need to index string columns larger than 900/1,700 bytes, which is the maximum size of a nonclustered index key. Even though such an index would not support  range scan  operations, it could be used for  point lookups .   

在需要索引大於900 / 1,700位元組的字串列的情況下,可以使用相同的技術,這是非聚簇索引鍵的最大大小。 即使這樣的索引不支援範圍掃描操作,它也可以用於點查詢。