1. 程式人生 > >CVPR2018論文筆記(四)PPFNet_Part4

CVPR2018論文筆記(四)PPFNet_Part4

本週主要讀(yì)了第五部分Results的內容。

5.Results##

(1)Setup

Our input encoding uses a 17-point neighborhood to compute the normals for the entire scene, using the well accepted plane fitting [18]. For each fragment, we anchor 2048 sample points distributed with spatial uniformity. These sample points act as keypoints and within their 30cm vicinity, they form the patch, from which we compute the local PPF encoding. Similarly, we down-sample the points within each patch to 1024 to facilitate the training as well as to increase the robustness of features to various point density and missing part. For occasional patches with insufficient points in the defined neighborhood, we randomly repeat points to ensure identical patch size. PPFNet extracts compact descriptors of dimension 64.


我們的輸入編碼利用由17個點的鄰域作為整個場景來計演算法線,利用公認的平面擬合。對於每個片段,我們錨定2048個具有分佈均勻性的樣本點。這些樣本點作為關鍵點,並且在它們的30cm附近形成貼片,從中我們計算區域性PPF編碼。類似地,我們降取樣每個貼片的點為1024,以便訓練,同樣增強特徵對各種點密度和確實部分的魯棒性。對於包含定義的鄰域中點不足的偶然的貼片,我們隨機重複點來確保相同的貼片大小。PPFNet提取64維度緊湊的描述符。
PPFNet is implemented in the popular Tensorflow [1]. The initialization uses random weights and ADAM [25] optimizer minimizes the loss. Our network operates simultaneously on all 2048 patches. Learning rate is set at 0.001 and exponentially decayed after every 10 epochs until 0.00001. Due to the hardware constraints, we use a batch size of 2 fragment pairs per iteration, containing 8192 local patches from 4 fragments already. This generates 2×20482 combinations for the network per batch.

PPFNet是在流行的Tensorflow中實現的。初始化利用隨機寬度和ADAM優化程式縮小損失。我們同時在全部2048個貼片上執行網路。學習率設定為0.001並且以指數方式在每10曆元衰減為0.00001。由於硬體條件的限制,我們每次迭代使用2個片段對的批處理大小,包含來自4個片段的8129個區域性貼片。這就使每批網路產生2×2048x2048個組合。

(2)Real Datasets

We concentrate on real sets rather than synthetic ones and therefore our evaluations are against the diverse 3DMatch RGBD benchmark [48], in which 62 different real-world scenes retrieved from the pool of datasets Analysis-by-Synthesis [42], 7-Scenes [38], SUN3D [46], RGB-D Scenes v.2 [27] and Halber et [15]. This collection is split into 2 subsets, 54 for training and validation, 8 for testing. The dataset typically includes indoor scenes like living rooms, offices, bedrooms, tabletops, and restrooms. See [48] for details. As our input consists of only point geometry, we solely use the fragment reconstructions captured by Kinect sensor and not the color.


我們專注於真實的集合而不是合成集合,因此我們的評估是針對不同的3DMatch RGBD基準,其中從資料集Analysis-by-Synthesis [42], 7-Scenes [38], SUN3D [46], RGB-D Scenes v.2 [27] ,Halber et [15]的池中檢索了62個不同的真實世界場景。這些集合分為2個子集,54個用來訓練和驗證,8個用來測試。資料集通常包括室內場景比如客廳、辦公室、我是、桌面和洗手間。詳情參考[48]。因為我們的輸入只包括點幾何,我們只通過Kinect感測器捕獲片段進行重建,而不使用顏色特徵。

(3)Can PPFNet outperform the baselines on real data?

We evaluate our method against hand-crafted baselines of Spin Images [21], SHOT [37], FPFH [34], USC [41], as well as 3DMatch [48], the state of the art deep learning based 3D local feature descriptor, the vanilla PointNet [30] and CGF [23], a hybrid hand-crafted and deep descriptor designed for compactness. To set the experiments more fair, we also show a version of 3DMatch, where we use 2048 local patches per fragment instead of 5K, the same as in our method, denoted as 3DMatch-2K. We use the provided pretrained weights of CGF [23]. We keep the local patch size same for all methods. Our evaluation data consists of fragments from 7-scenes [38] and SUN3D [46] datasets. We begin by showing comparisons without applying RANSAC to prune the false matches. We believe that this can show the true quality of the correspondence estimator. Inspired by [23], we accredit recall as a more effective measure for this experiment, as the precision can always be improved by better corresponding pruning [5, 4]. Our evaluation metric directly computes the recall by averaging the number of matched fragments across the datasets:
我們針對Spin Images [21], SHOT [37], FPFH [34], USC [41], 3DMatch [48] ,基於3D區域性特徵描述符的深度學習技術發展水平,基本的PointNet和CGF的手工基準線來評估我們的方法,設計一個混合手工和深度描述符用來使得描述符變得緊湊。為了讓實驗變得更公平,我們同樣展示了一個3DMatch的版本,其中每個片段我們用2048個區域性貼片而不是5000個,這和我們的3DMatch-2K的方法相同。我們使用CGF[23]所提供的預訓練權值。我們讓所有方法保持相同的區域性貼片大小。我們的評估資料來自7-scenes和SUN3D資料集。我們通過顯示比較來開始,而不是應用RANSAC減少錯誤匹配。我們相信,這可以顯示對應估計的真實質量。受到[23]的啟發,我們認可召回率作為本次實驗更有效的方式,因為通過更好的對應減少總可以提升精度。我們的評估度量直接通過平均資料集中匹配片段的數目來計算召回率:
公式來自論文
where M is the number of ground truth matching fragment pairs, having at least 30% overlap with each other under ground-truth transformation T and τ1 = 10cm. (i, j) denotes an element of the found correspondence set Ω. x and y respectively come from the first and second fragment under matching. The inlier ratio is set as τ2 = 0.05. As seen from Tab. 1, PPFNet outperforms all the hand crafted counterparts in mean recall. It also shows consistent advantage over 3DMatch-2K, using an equal amount of patches. Finally and remarkably, we are able to show ∼ 2.7% improvement on mean recall over the original 3DMatch, using only ∼ 40% of the keypoints for matching. The performance boost from 3DMatch-2K to 3DMatch also indicates that having more keypoints is advantageous for matching.
其中,M是地面真實匹配片段對的真實數量,和彼此地面真值變換T和τ1 = 10cm下有至少30%的重疊。(i, j)表示找到的對應集合Ω中的一個元素。x 和 y分別來自於匹配下的第一個和第二個片段。內徑比設定為τ2 = 0.05。正如表1中所看到的,PPFNet在平均召回率上優於所有手工選擇的同類方法。它同樣顯示比3DMatch-2K更具有相容的優勢,利用相等數量的蹄片。最後也是值得注意的是,相比於原始的3DMatch,我們在平均召回率上可以顯示大約2.7%的提升,僅用40%的關鍵點進行比配。從3DMatch-2K到3DMatch的效能提升也表明,有更多的關鍵點對於匹配也是有益的。
圖片來自論文
表1 我們在RANSAC之前對三維匹配基準進行的評估。
【RANSAC】關於RANSAC演算法的介紹https://www.cnblogs.com/weizc/p/5257496.html。

Our method expectedly outperforms both vanilla PointNet and CGF by 15%. We show in Tab. 2 that adding more samples brings benefit, but only up to a certain level (< 5K). For PPFNet, adding more samples also increases the global context and thus, following the advent in hardware, we have the potential to further widen the performance gap over 3DMatch, by simply using more local patches. To show that we do not cherry-pick τ2 but get gains, we also plot the recall computed with the same metric for different inlier ratios in Fig. 6(a). There, for the practical choices of τ2, PPFNet persistently remains above all others.
我們的方法預期優於基本的PointNet和CGF15%。我們在表2中顯示增加更多的樣本會帶來益處,但是僅達到一定水平(<5K)。對PPFNet,增加更多的樣本,也會增加全域性背景,因此,隨著硬體的出現,我們有可能通過單純地使用更多的區域性貼片進一步擴大與3DMatch的效能差距。為了顯示上述內容我們沒有優選τ2但是得到連續的收益,我們也在圖6中畫出了為不同的內徑比採用相同度量計算召回率。在這裡,對於τ2的實際選擇,PPFNet始終保持在其他之上。
表格來自論文
樣本越大,召回率越高
圖片來自論文

(4)Application to geometric registration

Similar to [48], we now use PPFNet in a broader context of transformation estimation. To do so, we plug all descriptors into the well established RANSAC based matching pipeline, in which the transformation between fragments is estimated by running a maximum of 50,000 RANSAC iterations on the initial correspondence set. We then transform the source cloud to the target by estimated 3D pose and compute the point-to-point error. This is a well established error metric [48]. Tab. 3 tabulates the results on the real datasets. Overall, PPFNet is again the top performer, while showing higher recall on a majority of the scenes and on the average. It is noteworthy that we always use 2048 patches, while allowing 3DMatch to use its original setting, 5K. Even so, we could get better recall on more than half of the scenes. When we feed 3DMatch 2048 patches, to be on par with our sampling level, PPFNet dominates performance-wise on most scenes with higher average accuracy.
和[48]類似,我們現在在一個變換估計的更廣泛背景中利用PPFNet。這樣做,我們將所有描述符新增到已建立的基於匹配流程的基於匹配流程的RANSAC中,其中,通過在初始對應集上執行一個最大值為50,000 的RANSAC迭代器來估計片段之間的轉換。然後,我們通過估計的3D姿態轉變源點雲到目標點雲中,並且計算點對點誤差。這是一個公認的誤差度量。表3列出真實資料集上的結果。總的來說,PPFNet又一次作為最突出表現者,同時在大多數場景和平均水平上表現出更高的召回率。值得注意的是,我們總是利用2048個貼片,同時允許3DMatch使用其原始設定5K。即使這樣,我們還是可以更好得復現超過一半的場景。當我們給3DMatch提供2048個貼片時,為了與我們的取樣水平相當,PPFNet在大多數場景中以更高的平均精度在大多數場景中佔據優勢地位。
表格來自論文

(5)Robustness to point density

Changes in point density, a.k.a. sparsity, is an important concern for point clouds, as it can change with sensor resolution or distance for 3D scanners. This motivates us to evaluate our algorithm against others in varying sparsity levels. We gradually decrease point density on the evaluation data and record the accuracy. Fig. 6(b) shows the significant advantage of PPFNet, especially under severe loss of density (only 6.5% of points kept). Such robustness is achieved due to the PointNet back end and the robust point pair features.
點密度的改變,即稀疏性,對點雲來說是一個重要的關注點,因為它可以隨著感測器的解析度或者3D掃描器的距離而改變。這就促使我們針對其他不同稀疏水平中評估我們的演算法。我們在評價資料上逐漸降低點密度,並記錄精度。圖6(b)顯示PPFNet的顯著優點,特別是在嚴重的密度損失下(僅保持6.5%的點)。這樣的魯棒性是以PointNet後端和魯棒點對特徵實現的。

(6)How fast is PPFNet?

We found PPFNet to be lightning fast in inference and very quick in data preparation since we consume a very raw representation of data. Majority of our runtime is spent in the normal computation and this is done only once for the whole fragment. The PPF extraction is carried out within the neighborhoods of only 2048 sample points. Tab. 4 shows the average running times of different methods and ours on an NVIDIA TitanX Pascal GPU supported by an Intel Core i7 3.2GhZ 8 core CPU. Such dramatic speed-up in inference is enabled by the parallelPointNet backend and our simultaneous correspondence estimation during inference for all patches. Currently, to prepare the input for the network, we only use CPU, leaving GPU idle for more work. This part can be easily implemented on GPU to gain even further speed boosts.
自從我們消耗非常原始的點雲資料表示,我們發現PPFNet的推理速度非常快,資料準備速度也非常快。我們的大部分執行時間都花費在一般的計算中,在整個片段中只完成一次。PPF提取僅在2048個取樣點的鄰域進行。表4顯示在由Intel Core i7 3.2GhZ 8核心CPU支援的NVIDIA TitanX Pascal GPU上,不同方法和我們的方法的平均執行時間。並行PointNet後端和針對所有貼片的推理期間,我們同時進行的對應估計使得推理的這種顯著加速成為可能。目前,為了準備網路的輸入,我們只用到了CPU,讓GPU空閒更多的工作。這部分可以很容易得在GPU 上實現,以獲得更進一步的速度提升。

5.1. Ablation Study

(1)N-tuple loss

We train and test our network with 3 different losses: contrastive (pair) [14], triplet [17] and our N-tuple loss on the same dataset with identical network configuration. Inter-distance distribution of correspondent pairs and non-correspondent pairs are recorded for the train/validation data respectively. Empirical results in Fig. 7 show that the theoretical advantage of our loss immediately transfers to practice: Features learned by N-tuple are better separable, i.e. non-pairs are more distant in the embedding space and pairs enjoy a lower standard deviation.N-tuples loss repels non-pairs further in comparison to contrastive and triplet losses because of its better knowledge of global correspondence relationships. Our N-tuple loss is general and thus we strongly encourage the application also to other domains such as pose estimation [44].
我們在具有相同網路配置的相同資料上訓練和測試我們包含三種不同的損失網路:contrastive (pair) [14], triplet [17]和N元組損失。分別為訓練、驗證資料記錄對應對和非對應對之間的間距。圖7的實驗證明結果表明我們損失的理論優勢迅速轉移到了實踐中:通過N元組學習的特徵具有更好的分離性,即在嵌入空間中,不成對的距離更遠,成對的有較低的標準偏差。同contrastive和triplet 損失方法相比,N元組損失更排斥非配對的點,因為它更瞭解全域性的對應關係。我們的N元組損失是普遍適用的,因此我們強烈鼓勵把它同樣應用到其他領域,比如姿態估計[44]。
圖片來自論文

(2)How useful is global context for local feature extraction?

We argue that local features are dependent on the context. A corner belonging to a dining table should not share the similar local features of a picture frame hanging on the wall. A table is generally not supposed to be attached vertically on the wall. To assess the returns obtained from adding global context, we simply remove the global feature concatenation, keep the rest of the settings unaltered, and re-train and test on two subsets of pairs of fragments. Our results are shown in Tab. 5, where injecting global information into local features improves the matching by 18% in training and 7% in validation set as opposed to our baseline version of Vanilla PointNet •, which is free of global context and PPFs. Such significance indicates that global features aid discrimination and are valid cues also for local descriptors.
我們認為區域性特徵依賴於環境。屬於餐桌的角落不應該同掛在牆上的相框中圖片的類似區域性特徵相共享。桌子通常不應該垂直地貼在牆上。為了評估從新增的全域性環境中獲取的回報,我們簡單地移除全域性特徵聯絡,儲存餘下的設定不變,並在片段的點對的子集上重新訓練和測試。我們的結果在表5中顯示,將全域性資訊注入到區域性特徵中,同我們的基線版本的基礎PointNet比,在訓練過程中提升18%的匹配度,和7%的驗證集匹配度,後者不受全域性背景和PPFs的影響。
圖片來自論文

(3)What does adding PPF bring?

We now run a similar experiment and train two versions of our network, with/without incorporating PPF into the input. The contribution is tabulated in Tab. 5. There, a gain of 1% in training and 5% in validation is achieved, justifying that inclusion of PPF increases the discriminative power of the final features. While being a significant jump, this is not the only benefit of adding PPF. Note that our input representation is composed of 33% rotation-invariant and 66% variant representations. This is already advantageous to the state of the art, where rotation handling is completely left to the network to learn from data. We hypothesize that an input guidance of PPF would aid the network to be more tolerant to rigid transformations. To test this, we gradually rotate fragments around z-axis to 180° with a step size of 30° and then match the fragment to the non-rotated one.
我們現在執行一個類似的實驗,並且將我們的網路訓練為兩個版本,將PPF合併到輸入中和不將PPF合併。相關貢獻在表5中列出。表中顯示,在訓練中獲得1%的增益,在驗證中獲得5%的增益,這就證明了包含PPF會增加對最終特徵的辨別能力。雖然這是一個重大的飛躍,但這不僅僅是新增PPF的唯一好處。請注意,我們的輸入表示是由33%的旋轉不變數和66%的變數組成。這對現有技術是有利的,其中旋轉處理完全留給網路以從資料中學習。我們假設PPF的輸入引導將有助於增加網路對剛性變換的寬容度。為了測試這一點,我們逐漸地繞z軸旋轉片段到180°,步長為30°,然後將片段與沒有旋轉的片段匹配。
圖片來自論文
As we can observe from Tab. 6, with PPFs, the feature is more robust to rotation and the ratio in matching performance of two networks opens as rotation increases. In accordance, we also show a visualization of the descriptors at Fig. 9 under small and large rotations. To assign each descriptor an RGB color, we use PCA projection from high dimensional feature space to 3D color space by learning a linear map [23]. It is qualitatively apparent that PPF can strengthen the robustness towards rotations. All in all, with PPFs we gain both accuracy and robustness to rigid transformation, the best of seemingly contradicting worlds. It is noteworthy that using only PPF introduces full invariance besides the invariance to permutations and renders the task very difficult to learn for our current network. We leave this as a future challenge. A major limitation of PPFNet is quadratic memory footprint, limiting the number of used patches to 2K on our hardware. This is, for instance, why we cannot outperform 3DMatch on fragments of Home-2. With upcoming GPUs, we expect to reach beyond 5K, the point of saturation.
正如我們可以從表6中觀察到的,對於PPFs,特徵對旋轉來說更有魯棒性,隨著旋轉的增加,兩個網路的匹配表現率也隨之增加。此外,我們還在圖9中展示了在小和大的旋轉下描述符的視覺化。為了給每個描述符分配RGB顏色,我們通過學習線性對映[23],採用從高維特徵空間到3D色彩空間的PCA投影。PPF可以定量明顯地增強對旋轉的魯棒性。總而言之,通過PPFs,我們不僅獲得了剛性變換的精確性和魯棒性,在看似矛盾的事物中最好的選擇。值得注意的是,除了對排序的不變性外,僅適用PPF還引入了完全不變性,並且對於我們當前的網路來說,使得任務非常難以學習。我們將視其為未來的挑戰。PPFNet的一個主要侷限性是二次記憶體的佔用空間,將我們硬體上使用貼片的數量限制為2K。這就是我們為什麼不能在Home-2的片段上勝過3DMatch的原因。隨著即將到來的GPU,我們預計飽和點的數目將達到超過5K。
圖片來自論文
圖9.正如圖中所示,包含PPF使得網路對旋轉的變化更具有魯棒性,為了一個完全不變的特徵,預期每一行的外觀保持一致。
圖片來自論文

  1. Conclusion

We have presented PPFNet, a new 3D descriptor tailored for point cloud input. By generalizing the contrastive loss to N-tuple loss to fully utilize available correspondence relatioships and retargeting the training pipeline, we have shown how to learn a globally aware 3D descriptor, which outperforms the state of the art not only in terms of recall but also speed. Features learned from our PPFNet is more capable of dealing with some challenging scenarios, as shown in Fig. 8. Furthermore, we have shown that designing our network suitable for set-input such as point pair features are advantageous in developing invariance properties. Future work will target memory bottleneck and solving the more general rigid graph matching problem.
我們已經提出了PPFNet,一個新的適合於點雲輸入的3D描述符。通過將對比度損失推廣到N元組損失以充分利用可獲得的對應關係,並且對訓練流程進行重新定向,我們已經展示瞭如何學習一個全域性感知的3D描述符,該描述符不僅在召回率方面而且在速度方面都優於現有技術。正如圖8中所示,從我們的PPFNet中學習特徵更有能力處理包含挑戰的場景。此外,我們還展示了,我們的網路設計為適用於像點對特徵的點集輸入對開發的不變性是有利的。未來的工作將針對記憶體瓶頸和解決更普遍的剛性圖形匹配問題。

標紅的部分都是現階段不是很拿捏得準的地方,以便後續更改。