1. 程式人生 > >Persistent Data Structures

Persistent Data Structures

ava small har parallel 兩個 app 以及 除法 cep

轉自http://www.cnblogs.com/tedzhao

Persistent Data Structures

可持久化的數據結構

Contents

內容

  • Introduction 介紹
  • Persistent Singly Linked Lists 可持久化單向鏈表
  • Persistent Binary Trees 可持久化二叉樹
  • Random Access Lists 隨機存取列表
  • ImmutableCollections 不可變類型集合類
    • Stack 堆棧
    • SortedList 有序列表
    • ArrayList 動態數組
    • Array 數組
    • RandomAccessLists 隨機存取列表
  • Conclusion 結論

Introduction

介紹

When you hear the word persistence in programming, most often, you think of an application saving its data to some type of storage, such as a database, so that the data can be retrieved later when the application is run again. There is, however, another meaning for the word persistence when it is used to describe data structures, particularly those used in functional programming languages. In that context, a persistent data structure is a data structure capable of preserving the current version of itself when modified. In essence, a persistent data structure is immutable.

當你在編程過程中聽到持久化這個單詞的時候,大多數情況下,你會認為是應用程序將其數據為存儲為某種類型的文件中,例如數據庫,以便於以後當應用程序再次運行時能夠從介質中重新獲取數據。然而這裏的持久化講的是另外一個意思,用其來描述一種數據結構,通常會用在一些函數式的編程語言中。從這個意義上來講,一個具有持久化能力的數據結構在其被修改後可以保存當前的狀態,從本質上來說,這樣的數據結構是不可改變類型(immutable)。

An example of a class that uses this type of persistence in the .NET Framework is the string class. Once a string object is created, it cannot be changed. Any operation that appears to change a string generates a new string instead. Thus, each version of a string object can be preserved. An advantage for a persistent class like the string class is that it basically gives you undo functionality built-in. As newer versions of a persistent object are created, older versions can be pushed onto a stack and popped off when you want to undo an operation. Another advantage is that because persistent data structures cannot change state, they are easier to reason about and are thread safe.

.NET Framework中的String類正好是使用了持久化能力的一個例子。一旦創建了一個String類型實例,它便不能被改變了,對於欲改變其值的任何操作都將被產生一個新的String對象,通過這樣,每一個版本的String實例都將被駐留下來。這樣的具有持久化特點的類型像String類型都內置了撤銷(Undo)功能,當該對象的新一個版本產生的時候,舊版本將被壓入棧中,如果需要執行撤銷動作的時候,只需將舊版本從堆棧中取出。另外一個優點是由於可持久化數據類型不能更改其內部狀態,很容易得知它是線程安全的。

There is an overhead that comes with persistent data structures, however. Each operation that changes a persistent data structure creates a new version of that data structure. This can involve a good deal of copying to create the new version. This cost can be mitigated to a large degree by reusing as much of the internal structure of the old version in creating a new one. I will explore this idea in making two common data structures persistent: the singly linked list and the binary tree, and describe a third data structure that combines the two. I will also describe several classes I have created that are persistent versions of some of the classes in the System.Collections namespace.

然而持久化的數據結構會帶來一些開銷,任何改變持久化數據結構的操作都將創建一個新的版本,這可能會涉及到大量的拷貝操作,通常我們可以通過重用舊版本對象的內部數據結構來創建一個新的對象,這種辦法可以極大地降低拷貝操作所帶來的消耗。我將會通過兩個常用的數據結構來闡述這個思想:單向列表以及二叉樹,然後通過這兩個數據結構來組合第三個數據結構。同時我也會講述System.Collection命名空間下面的那些持久化的類型。

Persistent Singly Linked Lists

持久化的單向鏈表

The singly linked list is one of the most widely used data structures in programming. It consists of a series of nodes linked together one right after the other. Each node has a reference to the node that comes after it, and the last node in the list terminates with a null reference. To traverse a singly linked list, you begin at the head of the list and move from one node to the next until you have reached the node you are looking for or have reached the last node:

單向鏈表是一個在編程中使用非常廣泛的基礎數據結構,它是由一系列相互鏈接的節點組成。每一個節點都擁有一個指向下一個節點的引用,鏈表中的最後一個節點將擁有一個空引用。如果你想遍歷一個單向鏈表,可以從第一個節點開始,逐個向後移動,直到到達最後的節點。

如下圖所示:

技術分享圖片

Let‘s insert a new item into the list. This list is not persistent, meaning that it can be changed in-place without generating a new version. After taking a look at the insertion operation on a non-persistent list, we‘ll look at the same operation on a persistent list.

讓我們插入一個新的節點到這個鏈表中去,並且該鏈表是非持久化的,也就是說這個鏈表可以被改變而無需產生一個新的版本。在查看了非持久化鏈表的插入操作之後,我們將會查看同樣的操作在持久化鏈表中。

Inserting a new item into a singly linked list involves creating a new node:

插入一個新的節點到單向列表中會涉及到創建一個新的節點:

技術分享圖片

We will insert the new node at the fourth position in the list. First, we traverse the list until we‘ve reached that position. Then the node that will precede the new node is unlinked from the next node...

我們將會在第四個位置插入新的節點,第一我們遍歷鏈表到達指定位置,也就是插入節點前面的那個節點,將其與後面節點斷開。

技術分享圖片

...and relinked to the new node. The new node is, in turn, linked to the remaining nodes in the list:

然後鏈接該節點與待插入節點,在下來,鏈接新的節點與上一步剩余的節點。

技術分享圖片

Inserting a new item into a persistent singly linked list will not alter the existing list but create a new version with the item inserted into it. Instead of copying the entire list and then inserting the item into the copy, a better strategy is to reuse as much of the old list as possible. Since the nodes themselves are persistent, we don‘t have to worry about aliasing problems.

如果插入一個新的節點到持久化的單向鏈表中,我們不應該改變當前鏈表的狀態,而需要創建一個新的鏈表而後插入指定節點。相對於拷貝當前鏈表,而後插入指定節點,一個更好的策略是盡可能的重用舊的鏈表。因為節點本身是可持久化的,所以我們不必擔心對象混淆的問題。

To insert a new node at the fourth position, we traverse the list as before only copying each node along the way. Each copied node is linked to the next copied node:

為了插入新節點到第四個位置,我們遍歷鏈表到指定位置,拷貝每個遍歷節點,同時指定拷貝的節點指向其下一個節點的拷貝。

技術分享圖片

The last copied node is linked to the new node, and the new node is linked to the remaining nodes in the old list:

最後一個拷貝的節點指向新的插入節點,而後,新節點指向舊鏈表剩下的節點。

技術分享圖片

On an average, about N/2 nodes will be copied in the persistent version for insertions and deletions, where N equals the number of nodes in the list. This isn‘t terribly efficient but does give us some savings. One persistent data structure where this approach to singly linked list buys us a lot is the stack. Imagine the above data structure with insertions and deletions restricted to the head of the list. In this case, N nodes can be reused for pushing items onto a stack and N - 1 nodes can be reused for popping a stack.

平均來看,對於插入和刪除操作,大約有N/2的節點將被拷貝,而N等於鏈表長度。這並不是特別的高效,僅僅只是節省了一些空間。與通過這樣的方式來構建單向鏈表一樣的一個數據結構是堆棧,我們可以想象一下在鏈表起始位置的插入以及刪除操作,在這個場景中,對於堆棧來講,壓棧操作時全部節點都可以被重用,而出棧操作也有N-1個節點被重用。

Persistent Binary Trees

持久化二叉樹

A binary tree is a collection of nodes in which each node contains two links, one to its left child and another to its right child. Each child is itself a node, and either or both of the child nodes can be null, meaning that a node may have zero to two children. In the binary search tree version, each node usually stores a key/value pair. The tree is searched and ordered according to its keys. The key stored at a node is always greater than the keys stored in its left descendents and always less than the keys stored in its right descendents. This makes searching for any particular key very fast.

一個二叉樹是一系列節點的集合,每一個節點都包含有兩個子節點,一個稱之為左節點,而另一個稱之為右節點。而子節點也是這樣一個節點,也有一個左節點和一個右節點,當然也可以沒有子節點,也就是說一個節點可能有零個或者兩個子節點。在二叉查找樹中,每一個節點通常包含了一個鍵值對,樹結構將會依照節點的鍵來進行查找和組織。節點的鍵會永遠大於其左節點的鍵,永遠小於其右節點的鍵,這將使得對於特定鍵的查找非常迅速。

Here is an example of a binary search tree. The keys are listed as numbers; the values have been omitted but are assumed to exist. Notice how each key as you descend to the left is less than the key of its predecessor, and vice versa as you descend to the right:

下圖是一個二叉查找樹的例子,節點的鍵作為數字被列出,而節點的值則被忽略盡管是始終存在的。註意到每一個左邊節點的鍵值一定會小於它的父節點即前驅節點,而每一個右邊節點的鍵值一定大於其父節點鍵值。

技術分享圖片

Changing the value of a particular node in a non-persistent tree involves starting at the root of the tree and searching for a particular key associated with that value, and then changing the value once the node has been found. Changing a persistent tree, on the other hand, generates a new version of the tree. We will use the same strategy in implementing a persistent binary tree as we did for the persistent singly linked list, which is to reuse as much of the data structure as possible when making a new version.

如果在一個非持久化的樹中更改一個特定節點的值,我們會從根節點按照特定鍵值開始搜索,如果找到則直接更改該節點的值。但是如果是在一個持久化的樹上的話,換句話說,我們需要創建一個新版本的樹,同時還需要保持同實現一個持久化的二叉樹或者單向鏈表一樣的策略,即盡可能的重用當前的數據來創建一個新的版本。

Let‘s change the value stored in the node with the key 7. As the search for the key leads us down the tree, we copy each node along the way. If we descend to the left, we point the previously copied node‘s left child to the currently copied node. The previous node‘s right child continues to point to nodes in the older version. If we descend to the right, we do just the opposite.

下面讓我們來嘗試改變鍵為7的節點的值,按照自頂向下查找該節點的路徑,我們需要拷貝該路徑上的每一個節點。如果轉向左邊,需要將上一個拷貝的節點指向當前拷貝節點,而前一個節點的右側節點則繼續指向原來舊版本的節點。如果轉向右邊,則采用相反的做法。

This illustrates the "spine" of the search down the tree. The red nodes are the only nodes that need to be copied in making a new version of the tree:

下圖列出了在樹上自頂向下搜索特定節點的路徑,在構建新版本的樹的時候僅僅需要拷貝那些紅色的節點。

技術分享圖片

You can see that the majority of the nodes do not need to be copied. Assuming the binary tree is balanced, the number of nodes that need to be copied any time a write operation is performed is at most O(Log N), where Log is base 2. This is much more efficient than the persistent singly linked list.

你能夠發現大多數節點是不要拷貝的,假定二叉樹是平衡的,在每一次節點值的寫操作中需要拷貝的節點數目大約是O(LogN),對數的底為2。顯然比起持久化的單向鏈表效率很高。

Insertions and deletions work the same way, only steps should be taken to keep the tree in balance, such as using an AVL tree. If a binary tree becomes degenerate, we run into the same efficiency problems as we did with the singly linked list.

插入以及刪除操作將按照同樣的方式進行,但是一些額外的保持樹平衡的操作還是必須的,例如使用AVL樹作為底層數據結構的時候。如果二叉樹變得很不平衡,我們將會碰到同樣的效率問題如同在持久化單向鏈表是一樣。

Random Access Lists

An interesting persistent data structure that combines the singly linked list with the binary tree is Chris Okasaki‘s random-access list. This data structure allows for random access of its items as well as adding and removing items from the beginning of the list. It is structured as a singly linked list of completely balanced binary trees. The advantage of this data structure is that it allows access, insertion, and removal of the head of the list in O(1) time as well as provides logarithmic performance in randomly accessing its items.

一個比較有意思的持久化數據結構是Chris Okasaki的隨機存取列表,它結合了單向鏈表和二叉樹的特點。這個數據結構除了允許用戶隨機操作其節點外,還支持在列表的起始位置添加和刪除節點。它被組織成為一個使用二叉樹來平衡的單向鏈表,其優點是當在其起始位置進行節點操作時,只需要O(1)的復雜度,而在隨機操作節點的時候,也只有O(Log(N)).

Here is a random-access list with 13 items:

下面是一個具有13個子節點的隨機存取列表:

技術分享圖片

When a node is added to the list, the first two root nodes (if they exist) are checked to see if they both have the same height. If so, the new node is made the parent of the first two nodes; the current head of the list is made the left child of the new node, and the second root node is made the right child. If the first two root nodes do not have the same height, the new node is simply placed at the beginning of the list and linked to the next tree in the list.

當添加一個節點到列表中的時候,前兩個根節點會被查看它們的高度是否相同,如果是的話,那新的節點將是這兩個節點的父節點,第一個節點將會作為插入節點的左子節點,而第二個節點會作為右節點。而如果這兩個節點高度不同,新的節將會直接被放在節點的起始位置,然後鏈接到剩余節點。

To remove the head of the list, the root node at the beginning of the list is removed, with its left child becoming the new head and its right child becoming the root of the second tree in the list. The new head of the list is right linked with the next root node in the list:

如果要刪除鏈表的頭節點,也就是要刪除鏈表的起始根節點,然後將其左側子節點作為新的頭節點,而右側子節點則作為鏈表中第個樹的根節點。新的頭節點會指向鏈表中向右的第二個根節點。

技術分享圖片

The algorithm for finding a node at a specific index is in two parts: in the first part, we find the tree in the list that contains the node we‘re looking for. In the second part, we descend into the tree to find the node itself. The following algorithm is used to find a node in the list at a specific index:

按照特定的索引查找節點的算法分為兩個步驟,第一步我們找到在列表中包含制定節點的樹,第二步自頂向下查找節點。下面的算法就是在列表中按照特定索引查找節點:

  1. Let I be the index of the node we‘re looking for. Set T to the head of the list where T will be our reference to the root node of the current tree in the list we‘re examining.

假定I是我們要查找的節點的索引,而T是列表的頭節點,通過T我們就可以找到列表中當前樹的根節點。

  1. If I is equal to 0, we‘ve found the node we‘re looking for; terminate algorithm. Else if I is greater than or equal to the number of nodes in T, subtract the number of nodes in T from I and set T to the root of the next tree in the list and repeat step 2. Else if I is less than the number of nodes in T, go to step 3.

如果I等於0,則我們已經找到了要查找的節點。如果I大於等於節點T的子節點數目,從I中減去T的節點數目,然後將T作為下一個數的根節點,重復第二部。如果I小於T,跳轉至第三步。

  1. Set S to the number of nodes in T divided by 2 (the fractional part of the division is ignored. For example, if the number of nodes in the current subtree is 3, S will be 1).

設定S為節點T子節點數目的一半,除法的小數部分將被忽略,如果節點數目為3,則S為1。

  1. If I is less than S, subtract 1 from I and set T to T‘s left child. Else subtract (S + 1) from I and set T to T‘s right child.

如果I小於S,I減一,然後設定T為T的左側子節點;否則I減去(S+1),然後設定T為T的右側子節點。

  1. If I is equal to 0, we‘ve found the node we‘re looking for; terminate algorithm. Else go to step 3.

如果I等於0,則我們已經找到了要查找的節點,否則跳轉至第三步。

This illustrates using the algorithm to find the 10th item in the list:

下圖描述了使用上面的算法來找到列表中第十個節點。

技術分享圖片

Keep in mind that all operations that change a random-access list do not change the existing list but rather generate a new version representing the change. As much of the old list is reused in creating a new version.

記住所有改變隨機存取列表的操作都不會改變現有列表,而是創建一個新的版本,並且在創建新版本的時候要盡可能充用現有列表。

Immutable Collections

不可改變集合類型

Included with this article are a number of persistent collection classes I have created. These classes are in a namespace called ImmutableCollections. I have created persistent versions of some of the collection classes in the System.Collections namespace. I will describe each one and some of the challenges in making them persistent. There are several collection classes that are currently missing; I need to add a queue, for example. Hopefully, I will get to those in time. Also, even though I‘ve taken steps to make these classes efficient, they cannot compete with the System.Collections classes in terms of speed, but they really aren‘t meant to. They are meant to provide the advantages of immutability while providing reasonable performance.

在本文中我創建了許多持久化的集合類型,放在命名空間ImmutableCollections下。對於System.Collection命名空間下的一些集合類,我也創建了一個持久化的版本。我將會逐個講述這些類型,闡述在持久化這些類時所遇到的問題及挑戰。當然有一些遺漏的,例如Queue。希望有時間我能夠將它們補上。盡管我已經采取了一些措施來提高性能,在存取速度上這些類還是不能與System.Collection命名空間的類相比較,但是這些類具有不可變類型的優點,而且具有合理的可以接收的性能。

Stack

堆棧

This one was easy. Simply create a persistent singly linked list and limit insertions and deletions to the head of the list. Since this class is persistent, popping a stack returns a new version of the stack with the next item in the old stack as the new top. In the System.Collections.Stack version, popping the stack returns the top of the stack. The question for the persistent version was how to make the top of the stack available since it cannot be returned when the stack is popped. I chose to create a Top property that represents the top of the stack.

這個類是比較容易的,可以創建一個持久化的單向鏈表,然後限定只能在起始位置進行插入和刪除操作。因為這個類是持久化的,出棧操作將會返回一個新版本的堆棧,這個堆棧以舊堆棧的第二個節點為頭節點。在System.Collection命名空間下,出棧操作僅僅只是刪除棧頂元素並返回。

SortedList

有序列表

The SortedList uses AVL tree algorithms to keep the tree in balance. I found it useful to create an IAvlNode interface. Two classes implement this interface, the AvlNode class and the NullAvlNode class. The NullAvlNode class implements the null object design pattern. This simplified many of the algorithms.

有序列表使用了AVL樹的算法來保持樹節點的平衡,我創建了一個叫IAvlNode的接口,有兩個類實現了這個接口,它們分別是AvlNode以及NullAvlNode類。NullAvlNode類利用了Null對象的設計模式,這將會簡化一些算法。

ArrayList

動態數組

This is the class that proved most challenging. Like the SortedList, it uses a persistent AVL tree as its data structure. However, unlike the SortedList, items are accessed by index (or by position) rather than by key. I have to admit that the algorithms for accessing and inserting items in a binary tree by index weren‘t intuitive to me, so I turned to Knuth. Specifically, I used Algorithms B and C in section 6.2.3 in volume 3 of The Art of Computer Programming.

這個類的實現會遇到更多的挑戰。與有序列表相同的是它也使用了持久化的AVL樹來作為其底層的數據結構,不同的地方是是用戶只能通過順序索引來操作列表元素而不是字符串索引。不得不說的是我的本意並不是在一個二叉樹上按照順序索引來操縱和插入列表元素,所以我查看了Knuth的書籍,準確地來講是使用了計算機編程的藝術第三卷6.2.3中的算法B和C。

I made an assumption about the ArrayList in order to improve performance. I assumed that the Add method is by far the most used method. However, adding items to the ArrayList one right after the other causes a lot of tree rotations to keep the tree in balance. To solve this, I created a template tree that is already completely balanced. Since this template tree is immutable, it can exist at the class level and be shared amongst all of the instances of the class.

為了提高動態數組的性能,我做了一個假設。假定Add方法是動態數組使用最多的方法,然而為了保持樹的平衡,添加對象操作會引起多次的樹旋轉。為了解決這個問題,我創建了一個完全平衡的模板樹,因為這個樹是不可更改的,它可以在類的級別上存在,且能夠被所有類的實例所共享。

When an instance of the ArrayList class is created, it takes a small subtree of the template tree. As items are added, the nodes in the template tree are replaced with new nodes. Since the tree is completely balanced, no rebalancing is necessary. If the subtree gets filled up, another subtree of equal height is taken from the template tree and joined to the existing tree. Insertions and deletions are handled normally with rebalancing performed if necessary. Again, the assumption is that adding items to the ArrayList occurs much more frequently than inserting or deleting items.

當一個動態數組的實例被創建的時候,它會抓住模板樹的一個子樹。當添加子節點的時候,模板樹上的節點將會被新添的節點所替換,因為模板樹本身就是平衡的,所以無需平衡樹的操作。如果這個子樹已經被填滿,則會在模板樹上抓取高度相同的另外一個子樹,然後加入當前存在的樹。當然插入和刪除操作就需要進行平衡操作了。再一次強調的是我們的假設是添加節點的操作會遠多於插入以及刪除操作,才可以這樣做。

Array

數組

The Array class uses the random access list structure to provide a persistent array with logarithmic performance. Unlike a random access list, it has a fixed size.

數組類使用隨機存取列表作為基礎的數據結構,而隨機存取列表在進行查找的時候只有Log(N)的復雜度,與隨機存取列表不同的是,數據具有固定的長度。

RandomAccessList

隨機存取列表

This class does not have a parallel in the System.Collections namespace, but it was one of the first persistent classes I wrote, and I decided to include it here. It‘s a straightforward implementation of Chris Okasaki‘s random-access list described above. This data structure was designed to be used in functional languages where lists have three basic operations: Cons, Head, and Tail. Cons adds an item to the head of the list, Head is the first item in the list, and Tail represents all of the items in the list except for the Head.

這個類型在.NET類庫的System.Collection命名空間下沒有對應的實現類,但是它是我寫的第一個持久化類,所以我決定在這裏也介紹一下。在Chris Okasaki的文章中有一個簡單易懂的實現,在一些函數式語言中會經常用到這個數據結構,通常它有三個基本操作:Cons,Head和Tail,Cons會添加一個新的對象到這個列表對象的開頭,而Head將會返回列表的第一個對象,通過Tail會得到列表中除了第一個對象外的所有對象。

Conclusion

結論

Persistent data structures help simplify programming by eliminating a whole class of bugs associated with side-effects and synchronization issues. They are not a cure-all but are a useful tool for helping a programmer deal with complexity. I have explored ways of making data structures persistent and have provided a small .NET library of persistent data structures. I hope you have enjoyed the article, and as always, I welcome feedback.

持久化數據結構會有助於簡化編程,將一些線程同步的問題消除掉。它並不是解救一切的靈丹妙藥,而是幫助程序員減低程序復雜度的一個工具。我已經闡述了如何構建持久化數據的多種方法,並且打包成一個小的.NET類庫。我希望你能夠從本文中受益,並且永遠歡迎您的反饋信息。

2008.11.13 更新:
原帖地址 :http://www.codeproject.com/KB/recipes/persistentdatastructures.aspx

大家可以從原帖中下載相關代碼。

Persistent Data Structures