1. 程式人生 > >讀書筆記 The Random Forest based Detection of Shadowsock's Traffic

讀書筆記 The Random Forest based Detection of Shadowsock's Traffic

base 讀書 tree 框架 mage 參數 mce ado rand

文章框架:

除去introduction、related work,文章首先介紹的是背景知識。1. 為什麽很難檢測Shadowsocks? 2. 隨機森林算法介紹(就是一種分類器,可以將流量分為兩類,一類是Shadowsocks traffic, 另一類是none Shadowsocks traffic) 3. 發送和接收數據時數據的相關定義。

接著是作者的提出的方法。首先要獲取的是訓練集,再用CART分類出Shadowsocks traffic。首先從F個特征中選f個特征,從這f個特征中選k個特征,這k個特征會有最好的分類效果,將其作為分類閾值。k個特征做出最好的分類效果的閾值是th。分類的樣本在當前node上的值比閾值小的會被扔到左邊的node。在預測的時候,從當前的CART根節點出發,比當前節點閾值小的扔到左節點,反之,扔到右節點。

如何選擇特征的維度呢?如下:

本文的特征工程:

根據網絡數據包hostProfile和biflow的屬性,我們提出幾個特征。然後,我們捕獲一大堆Shadowsocks的流量,提取某些特征值,並將其保存為訓練集。部分特征詳見表1。除此之外,我們還有一個3000維的向量,它記住在整個通信過程中是否出現上遊和下遊數據包的大小。

技術分享

實驗步驟與模型的參數選擇:

The steps of experiments:
? Capturing pure Shadowsocks’ traffic, dealing with these traffic, extract and save the certain features.
? Using Random Forest to model these value. In Random Forest Algorithm, we set the total value of CART

as 100, set grade criterion as ‘gini’, set the number of extract features as sqrt(C), C is the total number of feature dimensions. The largest depth of tree is set as None until all the nodes are identified. The classified results labeled as two classifications,
“Yes” and “No”. The remaining parameters are set as the default parameters in Python’s RandomForestClassifer function.

? Capturing detection traffic, including Shadowsocks’ traffic and none Shadowsocks’ traffic. Extracting the certain values of features and save them. Finally, using Random Forest Algorithm to build the models and to predict.

創新:

將機器學習用於流量監測的研究並不多,相比之前的人工識別提高了工作效率,文章了采用半監督機器學習算法。檢測精度get over 85%。

缺點:

  1. 特征太多了,雖然提高了精確度,但是增大計算負擔,而且模型相對冗余。

讀書筆記 The Random Forest based Detection of Shadowsock's Traffic